Aggregated TimeMap summary

ibnesayeed commented 7 years ago

We need an API endpoint that provides a summary of the aggregated TimeMap, preferably in JSON format. The summary can group memento counts for each upstream archive, and also a nested distribution on year and month levels.

machawk1 commented 7 years ago

preferably in JSON format

Why JSON?

Also mind the verbiage w/ "memento count" vs "URI-M count" per https://arxiv.org/abs/1703.03302

ibnesayeed commented 7 years ago

JSON will be directly consumable by many visualization libraries in JS or other languages. As far as "memento count" is concerned, if "URI-M count" is preferred then we might need to update the X-Memento-Count header as well.

ibnesayeed commented 7 years ago

Here is a sample output draft.

{
    "original_uri": "http://example.org/index.html",
    "total_mementos": 54,
    "archives": {
        "web.archive.org": {
            "count": 53,
            "first": {
                "datetime": "2002-10-16T10:13:37Z",
                "uri": "http://web.archive.org/web/20021016101337/http://example.org/index.html"
            },
            "last": {
                "datetime": "2016-04-10T22:12:45Z",
                "uri": "http://web.archive.org/web/20160410221245/http://example.org/index.html"
            }
        },
        "archive.is": {
            "count": 1,
            "first": {
                "datetime": "2013-09-16T08:37:01Z",
                "uri": "http://archive.is/20130916083701/http://example.org/index.html"
            },
            "last": {
                "datetime": "2013-09-16T08:37:01Z",
                "uri": "http://archive.is/20130916083701/http://example.org/index.html"
            }
        },
        "webarchive.org.uk": {
            "count": 0
        }
    },
    "periods": {
        "2002": {
            "10": 10,
            "12": 6
        },
        "2003": {
            "01": 1,
            "02": 3,
            "05": 2,
            "09": 1,
            "11": 4
        },
        "2005": {
            "02": 3,
            "04": 7,
            "05": 2,
            "08": 5
        },
        "2013": {
            "07": 1,
            "09": 3
        },
        "2016": {
            "02": 5,
            "19": 1
        }
    }
}

machawk1 commented 7 years ago

@ibnesayeed total_mementos seems semantically inconsistent with other quantifiers, e.g., the count fields in the above JSON and the X-Memento-Count header.

ibnesayeed commented 7 years ago

Thanks @machawk1, the point is taken. We can perhaps make it more coherent across the board. However, the primary goal of this sample output was to communicate the intended implementation to collect ideas of what other information can be provided to aid tools as well as what tools can be built if such information is available.

machawk1 commented 7 years ago

@ibnesayeed Right. I just wanted to encourage consistency. The temporal breakdown you have will be really useful. What are your thoughts on having that same sort of breakdown (optionally, additionally, and/or in lieu of the inter-archive) on a per-archive basis?

ibnesayeed commented 7 years ago

What are your thoughts on having that same sort of breakdown (optionally, additionally, and/or in lieu of the inter-archive) on a per-archive basis?

That is certainly doable if it seems useful for some applications/visualizations. However, it would increase the size of the response. One might also think about the possibility of breaking down data on archives within each monthly period too. So, I think we should structure it in a way that future extensions don't break the current structure while being able to add more fine-grained breakdown information.

Additionally, breakdown on http/https and www/naked can also be added. In future, if TimeMaps include status code, some stats on that can be provided as well.

machawk1 commented 7 years ago

Adding period information also increases the size of the response. There should probably be a way to specify the granularity of the temporal breakdown and whether per-archive information is included. The year-month choice seems arbitrary with the alternative of a by-year breakdown being a more expected default.

ibnesayeed commented 7 years ago

Adding period information also increases the size of the response.

It sure does, but the number of items are capped to a max of number of archives + 12 * number of years since archival started (this will only add a maximum of 12 entries each year). Nesting periods under archives or the other way brings in number of archives * 12 * number of years since archival started. The total number of mementos is a not a factor here.

There should probably be a way to specify the granularity of the temporal breakdown and whether per-archive information is included. The year-month choice seems arbitrary with the alternative of a by-year breakdown being a more expected default.

We can make a dedicated endpoint that accepts various parameters to let the client pick and choose what it wants. However, that would increase the complexity of the code (difficult to explain and maintain) and yield a confusing API documentation. This is perhaps the perfect opportunity to introduce GraphQL in MemGator, but I would hold on to it, because, it would require some serious planning to see what other endpoints can go that route.

For now, this endpoint should give enough high level summary of a TimeMap that can help various visualization and archival exploration applications. The choice of month granularity is a good compromise between usefulness and response size. without the complexities of parameters. A client can easily accumulate yearly stats from the monthly breakdown, but the reverse would not be possible.

machawk1 commented 6 months ago

I made some headway on this issue, see the issue-97 branch. The current output on that branch yields something like:

{
 "original_uri": "http://matkelly.com",
 "archives": {
  "web.archive.org":{
   "count": 208,
   "first":{
    "datetime": 20060514123511,
    "uri": "https://web.archive.org/web/20060514123511/http://www.matkelly.com:80/",
   }
   "last":{
    "datetime": 20240413142440,
    "uri": "https://web.archive.org/web/20240413142440/https://matkelly.com/",
   }
 },
  "archive.md":{
   "count": 18,
   "first":{
    "datetime": 20130618191814,
    "uri": "http://archive.md/20130618191814/http://matkelly.com/",
   }
   "last":{
    "datetime": 20210406203127,
    "uri": "http://archive.md/20210406203127/https://matkelly.com/",
   }
 },
  "wayback.archive-it.org":{
   "count": 3,
   "first":{
    "datetime": 20140210154006,
    "uri": "https://wayback.archive-it.org/all/20140210154006/http://matkelly.com/",
   }
   "last":{
    "datetime": 20160805024730,
    "uri": "https://wayback.archive-it.org/all/20160805024730/http://matkelly.com/",
   }
 },
  "arquivo.pt":{
   "count": 11,
   "first":{
    "datetime": 20200218230719,
    "uri": "https://arquivo.pt/wayback/20200218230719mp_/https://matkelly.com/",
   }
   "last":{
    "datetime": 20230121055854,
    "uri": "https://arquivo.pt/wayback/20230121055854mp_/http://matkelly.com/",
   }
 },
 "total_mementos": 240

The temporal breakdown still needs to be done and there are likely some formatting issues and code cleanup to do.

Task:

[ ] Change "datetime" value to long-form (RFC1123?) per @ibnesayeed's example
[ ] Create summary

machawk1 commented 6 months ago

@ibnesayeed Also suggested to add entries for archives that report zero mementos for the URI-R.

machawk1 commented 6 months ago

Also, change count to memento_count to be consistent.

oduwsdl / MemGator

Aggregated TimeMap summary #97