openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Do not download tmp/file_on_dante_pglaf_org every time? #152

Closed rimu closed 1 year ago

rimu commented 1 year ago

Every time I run gutenberg2zip a file is created called ./tmp/file_on_dante_pglaf_org.

This file is several hundred megabytes and it's contents are re-created each time the script it run. To make things worse the data downloads unusually slowly (roughly 25 KB per second), making each run of the script much slower than necessary.

Should we add a check to see if the local copy of the file is stale and only download it if necessary?

kelson42 commented 1 year ago

@rimu One important point is that each run of the scraper starts with an empty filesystem. The reason is that we have a cluster of scraper workers, so we don't know on which worker the next scrape will run. But we have a cache in S3. The question more IMO: should we keep a copy of this file in S3?

I don't have developed the Gutenberg scraper, but I believe this file is the catalogue of all ebooks. Considering every few days a new ebook is put to the Gutenberg library and that we run the scraper once month, I suspect that it has been decided that there was no added value to cache that file. But this is only my guess.

@rgaudin Any concrete feedback regarding this specific content?

rgaudin commented 1 year ago

The default behavior is to run all the steps, from a blank state, as @kelson42 said. If you've already run it, you can specify the steps you want to run instead. Check usage for available options. That rsync call happens during --parse step.

kelson42 commented 1 year ago

There is a ticket about using an OPDS catalogue which might simplify this part, see https://github.com/openzim/gutenberg/issues/97