Open arthuredelstein opened 3 months ago
Did you find the original code I used to timestamp the Internet Archive? It might still work.
Hi @petertodd -- it's this one, correct? https://github.com/opentimestamps/opentimestamps-server/blob/7eb34fa94f740e47b41375eb725603fd058d4ecc/mirror.py
I will give it a try. But I had the impression from your blog post that this script doesn't download the WaybackMachine data. Am I right about that or did I misunderstand?
Also, do you still have the original searchable database somewhere?
I'm a big fan of opentimestamps, btw!
The script didn't download any data. What it downloaded was the metadata, which (at the time) included sha1 hashes of all the actual data. At the time at least WaybackMachine data was also included in the general archive, and the metadata was available. Though the data itself was not available to the general public.
You can get the database that I made here: https://archive.org/details/opentimestamp-internetarchive-dataset It's a raw OTS calendar database, like any other. Except that the commitments in it are the raw sha1 hashes of the Internet Archive items that have been timestamped.
Thanks for the support!
@petertodd -- thank you! I have a couple of follow-up questions. I posted them here since it is somewhat off-topic for the internet archive: https://lists.opentimestamps.org/pipermail/ots-dev/2024-September/000122.html
I'm working on a digital timestamping project, based on https://opentimestamps.org/. I would like to digitally timestamp each day's archived content on the Wayback Machine. The way I would do this is to retrieve all content for a given date using the CDX API, and collect the SHA-1 digests for submitting to the timestamp service. Unfortunately, right now the CDX API doesn't allow me to search without providing a URL fragment. Could this restriction be lifted somehow? Thanks in advance.