Consider having consistent data always

openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg

GNU General Public License v3.0

126 stars 37 forks source link

This is OK and will prevent testing unwanted links but we may want to download those files with a more robust downloader to have extensive retries are we are talking about dozens of thousands of files so the risk for errors (connection mostly) is very high (although mitigated by S3).

The downloads actually occur through curl as seen in https://github.com/openzim/gutenberg/blob/825470482e30bd3ef70e79e587780504c0fc7fa4/gutenbergtozim/utils.py#L101 We might want to have it replaced by save_large_file() from scraperlib as that would eventually be maintained in a much better manner there. Opening an issue for this.

openzim / gutenberg

Consider having consistent data always #127