Closed rimu closed 1 year ago
@rimu One important point is that each run of the scraper starts with an empty filesystem. The reason is that we have a cluster of scraper workers, so we don't know on which worker the next scrape will run. But we have a cache in S3. The question more IMO: should we keep a copy of this file in S3?
I don't have developed the Gutenberg scraper, but I believe this file is the catalogue of all ebooks. Considering every few days a new ebook is put to the Gutenberg library and that we run the scraper once month, I suspect that it has been decided that there was no added value to cache that file. But this is only my guess.
@rgaudin Any concrete feedback regarding this specific content?
The default behavior is to run all the steps, from a blank state, as @kelson42 said. If you've already run it, you can specify the steps you want to run instead. Check usage for available options. That rsync call happens during --parse
step.
There is a ticket about using an OPDS catalogue which might simplify this part, see https://github.com/openzim/gutenberg/issues/97
Every time I run gutenberg2zip a file is created called ./tmp/file_on_dante_pglaf_org.
This file is several hundred megabytes and it's contents are re-created each time the script it run. To make things worse the data downloads unusually slowly (roughly 25 KB per second), making each run of the script much slower than necessary.
Should we add a check to see if the local copy of the file is stale and only download it if necessary?