Closed satyamtg closed 4 years ago
This is OK and will prevent testing unwanted links but we may want to download those files with a more robust downloader to have extensive retries are we are talking about dozens of thousands of files so the risk for errors (connection mostly) is very high (although mitigated by S3).
The downloads actually occur through curl as seen in https://github.com/openzim/gutenberg/blob/825470482e30bd3ef70e79e587780504c0fc7fa4/gutenbergtozim/utils.py#L101 We might want to have it replaced by save_large_file() from scraperlib as that would eventually be maintained in a much better manner there. Opening an issue for this.
This has the following changes -
This fixes #126