Open ibnesayeed opened 7 years ago
preferably in JSON format
Why JSON?
Also mind the verbiage w/ "memento count" vs "URI-M count" per https://arxiv.org/abs/1703.03302
JSON will be directly consumable by many visualization libraries in JS or other languages. As far as "memento count" is concerned, if "URI-M count" is preferred then we might need to update the X-Memento-Count
header as well.
Here is a sample output draft.
{
"original_uri": "http://example.org/index.html",
"total_mementos": 54,
"archives": {
"web.archive.org": {
"count": 53,
"first": {
"datetime": "2002-10-16T10:13:37Z",
"uri": "http://web.archive.org/web/20021016101337/http://example.org/index.html"
},
"last": {
"datetime": "2016-04-10T22:12:45Z",
"uri": "http://web.archive.org/web/20160410221245/http://example.org/index.html"
}
},
"archive.is": {
"count": 1,
"first": {
"datetime": "2013-09-16T08:37:01Z",
"uri": "http://archive.is/20130916083701/http://example.org/index.html"
},
"last": {
"datetime": "2013-09-16T08:37:01Z",
"uri": "http://archive.is/20130916083701/http://example.org/index.html"
}
},
"webarchive.org.uk": {
"count": 0
}
},
"periods": {
"2002": {
"10": 10,
"12": 6
},
"2003": {
"01": 1,
"02": 3,
"05": 2,
"09": 1,
"11": 4
},
"2005": {
"02": 3,
"04": 7,
"05": 2,
"08": 5
},
"2013": {
"07": 1,
"09": 3
},
"2016": {
"02": 5,
"19": 1
}
}
}
@ibnesayeed total_mementos
seems semantically inconsistent with other quantifiers, e.g., the count
fields in the above JSON and the X-Memento-Count
header.
Thanks @machawk1, the point is taken. We can perhaps make it more coherent across the board. However, the primary goal of this sample output was to communicate the intended implementation to collect ideas of what other information can be provided to aid tools as well as what tools can be built if such information is available.
@ibnesayeed Right. I just wanted to encourage consistency. The temporal breakdown you have will be really useful. What are your thoughts on having that same sort of breakdown (optionally, additionally, and/or in lieu of the inter-archive) on a per-archive basis?
What are your thoughts on having that same sort of breakdown (optionally, additionally, and/or in lieu of the inter-archive) on a per-archive basis?
That is certainly doable if it seems useful for some applications/visualizations. However, it would increase the size of the response. One might also think about the possibility of breaking down data on archives within each monthly period too. So, I think we should structure it in a way that future extensions don't break the current structure while being able to add more fine-grained breakdown information.
Additionally, breakdown on http
/https
and www
/naked
can also be added. In future, if TimeMaps include status code, some stats on that can be provided as well.
Adding period
information also increases the size of the response. There should probably be a way to specify the granularity of the temporal breakdown and whether per-archive information is included. The year-month choice seems arbitrary with the alternative of a by-year breakdown being a more expected default.
Adding
period
information also increases the size of the response.
It sure does, but the number of items are capped to a max of number of archives + 12 * number of years since archival started
(this will only add a maximum of 12 entries each year). Nesting periods under archives or the other way brings in number of archives * 12 * number of years since archival started
. The total number of mementos is a not a factor here.
There should probably be a way to specify the granularity of the temporal breakdown and whether per-archive information is included. The year-month choice seems arbitrary with the alternative of a by-year breakdown being a more expected default.
We can make a dedicated endpoint that accepts various parameters to let the client pick and choose what it wants. However, that would increase the complexity of the code (difficult to explain and maintain) and yield a confusing API documentation. This is perhaps the perfect opportunity to introduce GraphQL in MemGator, but I would hold on to it, because, it would require some serious planning to see what other endpoints can go that route.
For now, this endpoint should give enough high level summary of a TimeMap that can help various visualization and archival exploration applications. The choice of month
granularity is a good compromise between usefulness and response size. without the complexities of parameters. A client can easily accumulate yearly stats from the monthly breakdown, but the reverse would not be possible.
I made some headway on this issue, see the issue-97 branch. The current output on that branch yields something like:
{
"original_uri": "http://matkelly.com",
"archives": {
"web.archive.org":{
"count": 208,
"first":{
"datetime": 20060514123511,
"uri": "https://web.archive.org/web/20060514123511/http://www.matkelly.com:80/",
}
"last":{
"datetime": 20240413142440,
"uri": "https://web.archive.org/web/20240413142440/https://matkelly.com/",
}
},
"archive.md":{
"count": 18,
"first":{
"datetime": 20130618191814,
"uri": "http://archive.md/20130618191814/http://matkelly.com/",
}
"last":{
"datetime": 20210406203127,
"uri": "http://archive.md/20210406203127/https://matkelly.com/",
}
},
"wayback.archive-it.org":{
"count": 3,
"first":{
"datetime": 20140210154006,
"uri": "https://wayback.archive-it.org/all/20140210154006/http://matkelly.com/",
}
"last":{
"datetime": 20160805024730,
"uri": "https://wayback.archive-it.org/all/20160805024730/http://matkelly.com/",
}
},
"arquivo.pt":{
"count": 11,
"first":{
"datetime": 20200218230719,
"uri": "https://arquivo.pt/wayback/20200218230719mp_/https://matkelly.com/",
}
"last":{
"datetime": 20230121055854,
"uri": "https://arquivo.pt/wayback/20230121055854mp_/http://matkelly.com/",
}
},
"total_mementos": 240
The temporal breakdown still needs to be done and there are likely some formatting issues and code cleanup to do.
Task:
@ibnesayeed Also suggested to add entries for archives that report zero mementos for the URI-R.
Also, change count
to memento_count
to be consistent.
We need an API endpoint that provides a summary of the aggregated TimeMap, preferably in JSON format. The summary can group memento counts for each upstream archive, and also a nested distribution on year and month levels.