openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Consider having consistent data always #127

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

This has the following changes -

This fixes #126

satyamtg commented 4 years ago

This is OK and will prevent testing unwanted links but we may want to download those files with a more robust downloader to have extensive retries are we are talking about dozens of thousands of files so the risk for errors (connection mostly) is very high (although mitigated by S3).

The downloads actually occur through curl as seen in https://github.com/openzim/gutenberg/blob/825470482e30bd3ef70e79e587780504c0fc7fa4/gutenbergtozim/utils.py#L101 We might want to have it replaced by save_large_file() from scraperlib as that would eventually be maintained in a much better manner there. Opening an issue for this.