internetarchive / wayback

IA's public Wayback Machine (moved from SourceForge)
750 stars 134 forks source link

Allow retrieval of all saved items for a given day #273

Open arthuredelstein opened 3 months ago

arthuredelstein commented 3 months ago

I'm working on a digital timestamping project, based on https://opentimestamps.org/. I would like to digitally timestamp each day's archived content on the Wayback Machine. The way I would do this is to retrieve all content for a given date using the CDX API, and collect the SHA-1 digests for submitting to the timestamp service. Unfortunately, right now the CDX API doesn't allow me to search without providing a URL fragment. Could this restriction be lifted somehow? Thanks in advance.

petertodd commented 3 months ago

Did you find the original code I used to timestamp the Internet Archive? It might still work.

arthuredelstein commented 2 months ago

Hi @petertodd -- it's this one, correct? https://github.com/opentimestamps/opentimestamps-server/blob/7eb34fa94f740e47b41375eb725603fd058d4ecc/mirror.py

I will give it a try. But I had the impression from your blog post that this script doesn't download the WaybackMachine data. Am I right about that or did I misunderstand?

Also, do you still have the original searchable database somewhere?

I'm a big fan of opentimestamps, btw!

petertodd commented 2 months ago

The script didn't download any data. What it downloaded was the metadata, which (at the time) included sha1 hashes of all the actual data. At the time at least WaybackMachine data was also included in the general archive, and the metadata was available. Though the data itself was not available to the general public.

You can get the database that I made here: https://archive.org/details/opentimestamp-internetarchive-dataset It's a raw OTS calendar database, like any other. Except that the commitments in it are the raw sha1 hashes of the Internet Archive items that have been timestamped.

Thanks for the support!

arthuredelstein commented 1 month ago

@petertodd -- thank you! I have a couple of follow-up questions. I posted them here since it is somewhat off-topic for the internet archive: https://lists.opentimestamps.org/pipermail/ots-dev/2024-September/000122.html