Open phonedude opened 3 years ago
I'd love to support more archives, however, two things make the wayback machine attractive for this:
Time Travel is lovely, but it takes a really long time to load (30-50 seconds for any given lookup, many of them just time out), and doesn't (as far as I can tell?) support any form of site search, or URL prefix matching. Given the connection speed of some of the devices I'm targeting, I'd much prefer to reduce waiting as much as possible, as a lot of it will be in the network.
The Internet Archive's CDX API also allows significant optimisation of the code on my end due to allowing complex filtering of snapshots, and as far as I can tell Memento APIs don't permit this, which in turn makes their responses take longer as they have to return all their data at once.
Do you have any suggestions for mitigating these performance issues?
Yeah, there's no doubt IA WM was the first, is the biggest, etc. and if you can only support one, that's the one. And for the 90s, IA is pretty much the only game in town, and if the other archives have pages, they're typically just copies of IA's WARCs (not always, but mostly).
Most other archives don't support prefix search, etc. yet, so there will be a trade-off re: breadth and features. One solution would be to offer different branches: IA, and various non-IA in another. You don't have to go through TT, you could contact some of the other archives directly. Or you could run your own instance of MemGator, and specify the non-IA archives that you'd like to poll (e.g., just arquivo.pt, archive.today, perma.cc, and wayback.vefsafn.is). The non-IA archives are likely to be sparse for many URLs, so the responses should be small and relatively quick. Try MemGator; it doesn't do any processing or pagination and is thus pretty fast.
$ time curl -isL memgator.cs.odu.edu/timemap/link/www.nasa.gov [...]
real 0m18.784s user 0m0.118s sys 0m0.555s $ time curl -isL memgator.cs.odu.edu/timemap/link/www.nasa.gov | wc -l 63969
real 0m6.465s user 0m0.112s sys 0m0.216s
The second call responded quickly (6s) because IA had cached its response. But the first call at 18s isn't too bad given the size of the response.
Other formats are similar:
$ time curl -isL memgator.cs.odu.edu/timemap/json/www.nasa.gov | wc -l 255847
real 0m10.538s user 0m0.164s sys 0m0.256s $ time curl -isL memgator.cs.odu.edu/timemap/cdxj/www.nasa.gov | wc -l 63969
real 0m13.450s user 0m0.129s sys 0m0.251s
Finally, and I know you've already mentioned it in your repo, but regardless of the endpoint, some kind of caching would be a huge win for your application. It might even be worth it to go custom, since you're focused on data prior to a certain year (2000? 2005? 2010?). Most of the updates from all archives are going to come from the recent past. But even a standard reverse proxy would be super speedy. If you can keep robots out of your service, you'll probably get a lot of cache hits.
I'm guessing you had happened to prime some cache on their end before your testing, because timing a request to memgator.cs.odu.edu/timemap/json/www.nasa.gov
takes more than a minute for me, but on subsequent runs it's closer to ten seconds. Neither are particularly good times, and I intend to optimise around the cold cache state.
mementoweb's API times out after two minutes, which isn't enough time to fetch the history for, for instance, apple.com, resulting in a 504 Gateway Timeout
and no usable data. On the third attempt to load this it actually did respond, but this request was about twenty minutes later, and presumably taking advantage of a primed cache laid down by the other two requests. A twenty minute request-retry cycle doesn't feel acceptable.
I just don't believe that the Memento protocol as designed is fit for this purpose, given it's missing fundamental features like filtering, date range specifiers, or even rudimentary pagination. This is the only way I can see that something like this can consume the API in an efficient, performant way; the CDX API allows me to reduce the complexity on both ends of this equation by requesting only a small subset of the data for the initial query, and when the user has drilled down, a month's worth of less-filtered data.
I've realised overnight that the site search thing isn't really a blocker to this; there's no reason I couldn't leverage Wayback Machine for site search and an aggregator for the actual history, but the performance issues due to limitations of the Memento APIs remain an issue.
I missed this bit in your prior response:
Try MemGator; it doesn't do any processing or pagination and is thus pretty fast.
The lack of processing or pagination is, IMO, exactly the cause of the trouble! It means that each archive used by the aggregator has its full history queried, which means fetching a potentially huge number of items.
There are many additional web archives that could be supported, esp. if this service used Memento TimeMaps, either aggregated through TimeTravel or directly via their TimeMap URIs.
Some lists of archives:
Memento Quick Intro