openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Raise error on zimwriterfs failure and update download cache always #133

Closed satyamtg closed 3 years ago

satyamtg commented 3 years ago

Runs zimwriterfs directly through subprocess, captures its output, and raises SystemExit if returncode isn't 0. Also logs zimwriterfs output in that case.

Also this allows updating download cache with optimized versions of files if S3 URL was not passed. This makes it helpful for future runs on same dl-cache

satyamtg commented 3 years ago

For the upload, what you added is that if running with S3 enabled but you're not downloading a file because it's already present on disk, you'd upload it to the S3 cache ? Is that it? Then we don't want that. We want to only upload to S3 after optimizing, after downloading from source because it was not in S3. This gutenberg-only dl-cache will be removed anyway.

Nope. Previously what happened was this - If we do not give the --optimization-cache argument while running, downloads and optimizations went well. However, after optimizing, the optimized files were kept only in static. The download cache was not updated to have the optimized files instead of the unoptimized one. So, if you run export steps multiple times, the optimization would take place always. What happens now is - If we do not give the --optimization-cache argument while running, downloads and optimizations go well and after the optimization, the dl-cache contains the optimized file in optimized_dir instead of unoptimized files in unoptimized_dir