internetarchive / fatcat

Perpetual Access To The Scholarly Record
https://guide.fatcat.wiki
Other
113 stars 18 forks source link

many file webarchive (wayback) URLs have only 12 of 14 timestamp digits #81

Open bnewbold opened 3 years ago

bnewbold commented 3 years ago

For example:

https://fatcat.wiki/file/rcbebk4ox5esbnnpipbnegy7si

Some file entities have two wayback URLs, one with 12 digits and one with the full 14. In the majority of cases, however, there is only a single URL with 12 digits. Informally, this seems to impact something like 10% to 30% of all file entities (!).

The root of the problem was a bug in the old arabesque pipeline for doing crawl-specific imports, before the sandcrawler/crawl-bot pipeline was adopted. The bot agent creating bad metadata was fatcat_tools.ArabesqueMatchImporter, but the root of the problem was a bug in arabesque itself storing only 12 digits in sqlite.

Among other problems, having only 12 digits results in an extra wayback redirect at fetch time (inefficient), and make exact string comparisons break, resulting in multiple wayback URLs being added.

Cleanup jobs will need to be written, tested, and executed which:

bnewbold commented 2 years ago

The vast majority of these, more than 9.5 million file entities, have now been updated. In addition to the 12-digit problem, there were also many 4-digit (year only) URLs expanded.

See notes at:

Remaining task is to do a check of remaining invalid URLs after the next bulk metadata export, and investigate why a small fraction of URLs could not be fixed.