remy / unrot.link

Un-rot your links using the archive.org
26 stars 3 forks source link

wayback lookup fails past latest snapshot date #7

Open willnorris opened 8 months ago

willnorris commented 8 months ago

I was trying to figure out why the haveamint.com link in this post of mine wasn't redirecting to the wayback url.

It appears to be due to how unrot is calling the CDX API. My blog post has a publish date of 2023-12-15, so that's being included in the CDX call. But the last snapshot that wayback has for that URL is 2023-06-07. Because unrot is specfying the date as the from parameter, this is understandably returning 0 results:

% curl "https://web.archive.org/cdx/search/cdx?output=json&filter=statuscode:200&url=https%3A%2F%2Fhaveamint.com%2F&from=20231215"
[]

You could possibly change that to use the to parameter, accepting slightly older snapshots. However, that seems to have some odd results when using limit=-1, returning a much older snapshot from 2017:

% curl "https://web.archive.org/cdx/search/cdx?output=json&filter=statuscode:200&url=https%3A%2F%2Fhaveamint.com%2F&to=20230608&limit=-1"
[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
["com,haveamint)/", "20170710024046", "https://haveamint.com/", "text/html", "200", "B6E7OAXOGHFZCQ6QW4UGO2QCWQES625M", "3313"]]

I suspect this has to do with the fastLatest=true parameter being implied, and whatever caching that is doing to try and determine the latest result. Specifying even two latest results seems to get the actual latest:

% curl "https://web.archive.org/cdx/search/cdx?output=json&filter=statuscode:200&url=https%3A%2F%2Fhaveamint.com%2F&to=20230608&limit=-2"
[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
["com,haveamint)/", "20230606093924", "https://haveamint.com/", "text/html", "200", "YEDT2GZL7TVOPCGGP6MB3EJ44A7T3D56", "3710"],
["com,haveamint)/", "20230607154300", "https://haveamint.com/", "text/html", "200", "YEDT2GZL7TVOPCGGP6MB3EJ44A7T3D56", "3709"]]

Reading the CDX docs, that does seem to be a more expensive query for the wayback API, particularly for URLs with lots of snapshots.

This is also somewhat of a contrived example where I'm intentionally writing a post with a known dead link, so that may not be worth optimizing for. For my case, I could always just manually make it an unrot link, or maybe specify a snapshot date to use in the link itself: <a href="https://haveamint.com" data-snapshot-date="2023-06-07"> or something. But I thought I'd bring it up and see what you think.

remy commented 8 months ago

Good shout. I hadn't seen the fastLatest=true so I'll do some timing checks, but maybe I can use that (disabled) in the "backup" call.

I'll definitely fix this in the coming week (just slightly more offline during xmas period)