internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.2k stars 1.36k forks source link

Data dumps unavailable during generation at beginning of month #9835

Closed francispeixoto closed 2 months ago

francispeixoto commented 2 months ago

Problem

The All Types and Ratings dump file links are currently 404.

All Types: https://openlibrary.org/data/ol_dump_latest.txt.gz Ratings: https://openlibrary.org/data/ol_dump_ratings_latest.txt.gz

Reproducing the bug

  1. Go to https://openlibrary.org/developers/dumps
  2. Click on either the All Types or Ratings dump file download link

All Types: https://ia601601.us.archive.org/27/items/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz Ratings: https://ia601601.us.archive.org/27/items/ol_dump_2024-08-31/ol_dump_ratings_2024-08-31.txt.gz

Context

Breakdown

Requirements Checklist

Related files

*

Stakeholders

*


Instructions for Contributors

scottbarnes commented 2 months ago

Do these links work for you now, @francispeixoto? We may have an issue whereby the links are broken while the dumps are generating.

francispeixoto commented 2 months ago

Do these links work for you now, @francispeixoto? We may have an issue whereby the links are broken while the dumps are generating.

Yep both are working now. Thanks!

francispeixoto commented 2 months ago

looks like it still fails from console tho:

$ wget https://openlibrary.org/data/ol_dump_latest.txt.gz                            
--2024-09-02 16:39:02--  https://openlibrary.org/data/ol_dump_latest.txt.gz
Resolving openlibrary.org (openlibrary.org)... 207.241.234.205
Connecting to openlibrary.org (openlibrary.org)|207.241.234.205|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://archive.org/download/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz [following]
--2024-09-02 16:39:02--  https://archive.org/download/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ia600800.us.archive.org/13/items/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz [following]
--2024-09-02 16:39:03--  https://ia600800.us.archive.org/13/items/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz
Resolving ia600800.us.archive.org (ia600800.us.archive.org)... 0.0.0.0, ::
Connecting to ia600800.us.archive.org (ia600800.us.archive.org)|0.0.0.0|:443... failed: Connection refused.
Connecting to ia600800.us.archive.org (ia600800.us.archive.org)|::|:443... failed: Connection refused.
scottbarnes commented 2 months ago

Hmm, the plot thickens. I can use wget to fetch both from https://openlibrary.org/data/ol_dump_latest.txt.gz and https://ia600800.us.archive.org/13/items/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz, at least at the minute.

francispeixoto commented 2 months ago

Hmm, the plot thickens. I can use wget to fetch both from https://openlibrary.org/data/ol_dump_latest.txt.gz and https://ia600800.us.archive.org/13/items/ol_dump_2024-08-31/ol_dump_2024-08-31.txt.gz, at least at the minute.

Welp I tried again on a 5g tether to bypass my firewall and it worked. Looks like I've got an investigation on my hands. Sorry for the worry!

francispeixoto commented 2 months ago

Looks like pihole doesn't like archive.org out of the box. I had to explicitly whitelist it and now my scripts fetch the file properly

mekarpeles commented 2 months ago

When a new data dump at the beginning of the month is being generated and uploaded to archive.org (which is where the download occurs from) there is a period of time where the item containing the files exists but the content is not ready yet. Therefore, the link resolves to this "in-progress" item that seems not to work. We can probably add some logic so the previous month's dump is used until the latest one is ready, but for now (for anyone who hits this in the future) a workaround is searching for all historical dumps on archive.org for open library and using the latest working one until the in-progress latest is ready.

Related links for anyone who may want to explore another work-around: https://github.com/internetarchive/openlibrary/blob/91bca06dbd23b080b827ca7a273af1eecde48353/openlibrary/plugins/upstream/data.py#L76-L85 https://github.com/internetarchive/openlibrary/blob/91bca06dbd23b080b827ca7a273af1eecde48353/openlibrary/plugins/upstream/data.py#L15-L22

mekarpeles commented 2 months ago

@francispeixoto if you're interested in spending a few moments looking into a solution, we'd appreciate it! Though we're marking as questions resolved (i.e. problem identified, workaround offered)