oduwsdl / MemGator

A Memento Aggregator CLI and Server in Go
https://memgator.cs.odu.edu/api.html
MIT License
56 stars 11 forks source link

Expose configuration file (archives.json) to be web accessible #127

Open machawk1 opened 4 years ago

machawk1 commented 4 years ago

Memento 1.0-RC8 exposes the list of archives aggregated on the /about and primary web endpoint.

## Upstream Archives

1. [Archive.today](https://archive.today/)
2. [Portuguese Web Archive](https://arquivo.pt/)
3. [Perma Archive](https://perma.cc/)
4. [Stanford Web Archive](https://swap.stanford.edu/)
5. [BAnQ](https://waext.banq.qc.ca/)
6. [Archive-It](https://wayback.archive-it.org/)
7. [Icelandic Web Archive](https://wayback.vefsafn.is/)
8. [Bibliotheca Alexandrina Web Archive](https://web.archive.bibalex.org/)
9. [Internet Archive](https://web.archive.org/)
10. [Australian Web Archive](https://web.archive.org.au/)
11. [Library and Archives Canada](https://webarchive.bac-lac.gc.ca/)
12. [Library of Congress](https://webarchive.loc.gov/)
13. [UK National Archives Web Archive](https://webarchive.nationalarchives.gov/)
14. [National Records of Scotland](https://webarchive.nrscotland.gov.uk/)
15. [UK Web Archive](https://webarchive.org.uk/)
16. [UK Parliament Web Archive](https://webarchive.parliament.uk/)

It might be useful to expose the archives' respective endpoints. If an archive is disabled or "sleeping", it might also be useful to expose this information. From what I recall, the "disabled" status is present in the JSON file but the "sleeping" attribute that occurs after some number of failures is runtime generated, so that might be trickier.

Regardless, it would be useful to expose the archives.json file that is being used in the current instance.

ibnesayeed commented 4 years ago

From what I recall, the "disabled" status is present in the JSON file but the "sleeping" attribute that occurs after some number of failures is runtime generated, so that might be trickier.

On the contrary, I think it will be trickier to report ignored (disabled explicitly in the input archive list file) archives because we filter them off immediately after parsing the file and do not keep any records of those ignored archives in the memory as they will not be contributing in the process for the entire uptime of the service. The runtime structure is easier to report, it simply requires marshaling the array into JSON.

I can think of adding more runtime attributes to the structure of each archive, such as:

Obviously, these counters will only keep track of the state for the uptime of the instance. If we also report the uptime of the instance and total number of received requests under the /about endpoint, this will enable a nice time series visualization about the health of instance and upstream archives.

machawk1 commented 4 years ago

All of the extra information you suggested would be really useful, too.

What about also keeping a record of timestamps/ranges for which an archive was dormant? This seems like it might require a lot of bookkeeping.

we filter them off immediately after parsing the file and do not keep any records

Couldn't we keep a record of when we filter them off to be reported later?

ibnesayeed commented 4 years ago

What about also keeping a record of timestamps/ranges for which an archive was dormant? This seems like it might require a lot of bookkeeping.

That would be a serious memory leak as the amount of memory needed to run the service will continue to rise indefinitely for the lifetime of the instance. Keeping counters is cheap as the value is replaced in place. By knowing the number of dormant sessions of each archives and already knowing the configurable dormant period, one can simple multiply the two to get an overall duration for which an archive was not being aggregated from.

Couldn't we keep a record of when we filter them off to be reported later?

We could, but I do not see a compelling reason to do so. The purpose of the option to ignore selected archives before an instance is started is to allow the service maintainers to keep record of things that they used in the past, anticipate using in the future, or have some private entries for testing. There is not a lot to offer by exposing such private information. The ignore attribute of the archives list is a way to say, "hey aggregator, don't bother about these and assume as if they don't exist".