Add optimization cache support

openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg

https://download.kiwix.org/zim/gutenberg

GNU General Public License v3.0

126 stars 37 forks source link

Add optimization cache support #114

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

THis will fix #101 by introducing S3 based optimization cache

satyamtg commented 4 years ago

Currently, this S3 implementation works for most of the part. I have successfully tested it for both S3 upload and download with Wasabi. However, some small things still remain to test and implement, one of them being covers. Currently covers are zipped in the HTML format and uploaded there. I don't think this should be a problem as we anyways add "html" to formats if it's not there. What do you think of it @rgaudin ?

Also, I switched to logger from zimscraperlib as with previous implementation all logs from boto3 were also shown which made debugging a nightmare.

Currently it works as follows - Anything downloaded from optimization cache is renamed as <optimized>_<book id>_<filename in static folder>, so, we just copy these to static folder while exporting, create cover articles for them and skip optimization. I did this as this seems to be a simple way to solve the problem of exploded procedures as we also maintain a download cache. Also, we associate Etag with book to store that in DB to be used while uploading.

satyamtg commented 4 years ago

Most of the optimized files are uploaded to cache and it works perfectly. However, there's some HTMLs which do not follow the URL pattern in aleph.gutenberg.org (didn't do the full rsync, so all combinations were tried), and hence do not have an ETag. Those are not uploaded to cache and are always downloaded from the site. These are very few in numbers. So, should we upload them to cache with some default ETag or should we leave them (which is currently done).

Also, I've done the following changes -

zips are deleted after extraction (as they have no role to play and take up unnecessary space)
Since some zips already contain covers (and so does our S3 uploads), we now check if covers are already downloaded or not before downloading them

What do you think @kelson42 , @rgaudin , @dattaz ?

satyamtg commented 4 years ago

@rgaudin thanks for that comprehensive review. I have done several changes addressing it. The following has changed -

The directory structure in dl-cache is now different. Instead of having everything dumped in one place, we now have subdirectories for each book and further 2 directories in each - optimized (contains files downloaded from optimization cache) and unoptimized (holds files downloaded from source server). This allows us not to use a list of files, or look at many files in a long loop.
New code uses pathlib.Path wherever possible
Book covers are uploaded to cache separately and handled seperately
We now have a optimizer_version associated with each upload to S3 bucket
Etags for the files I mentioned that didn't follow the pattern are fixed (It was actually my bad, They were URLs ending with .html.utf8, and I didn't categorize them in the condition to get Etag and try S3 download"

rgaudin commented 4 years ago

Here's where I'm stuck at:

python gutenberg2zim -l fr -b 40248,28397,2650,30602 --download --export --zim --force --title-search --bookshelves --optimization-cache="https://s3.us-west-1.wasabisys.com/?keyId=AA&secretAccessKey=BB&bucketName=org-kiwix-dev-gutenberg"

[gutenbergtozim::2020-06-10 11:17:25,576] INFO:testing S3 Optimization Cache credentials
removing `AWS_PROFILE` variable from environment
[gutenbergtozim::2020-06-10 11:17:30,742] INFO:S3 Credentials OK. Continuing ...
[gutenbergtozim::2020-06-10 11:17:30,743] INFO:SETTING UP DATABASE
[gutenbergtozim::2020-06-10 11:17:30,743] INFO:Setting up the database
[gutenbergtozim::2020-06-10 11:17:30,744] DEBUG:license table already exists.
[gutenbergtozim::2020-06-10 11:17:30,745] DEBUG:format table already exists.
[gutenbergtozim::2020-06-10 11:17:30,745] DEBUG:author table already exists.
[gutenbergtozim::2020-06-10 11:17:30,746] DEBUG:book table already exists.
[gutenbergtozim::2020-06-10 11:17:30,746] DEBUG:bookformat table already exists.
[gutenbergtozim::2020-06-10 11:17:30,747] DEBUG:url table already exists.
[gutenbergtozim::2020-06-10 11:17:30,747] INFO:DOWNLOADING ebooks from mirror using filters
[gutenbergtozim::2020-06-10 11:17:30,766] INFO: Downloading content files for Book #2650
[gutenbergtozim::2020-06-10 11:17:30,767] INFO: Downloading content files for Book #28397
[gutenbergtozim::2020-06-10 11:17:30,767] INFO: Downloading content files for Book #30602
[gutenbergtozim::2020-06-10 11:17:30,767] INFO: Downloading content files for Book #40248
Traceback (most recent call last):
  File "gutenberg2zim", line 276, in <module>
    main(docopt(help, version=VERSION))
  File "gutenberg2zim", line 205, in main
    else None,
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 372, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 369, in dlb
    b, download_cache, languages, formats, force, s3_storage, optimizer_version
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 145, in download_book
    if not [fl for fl in dir_name.iterdir()]:
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 145, in <listcomp>
    if not [fl for fl in dir_name.iterdir()]:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pathlib.py", line 1081, in iterdir
    for name in self._accessor.listdir(self):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/2650/optimized'

satyamtg commented 4 years ago

Here's where I'm stuck at:

I think this should solve that (line 142) -

for dir_name in [optimized_dir, unoptimized_dir] and dir_name.exists():

Will change this and some other things that you pointed out.

rgaudin commented 4 years ago

Thanks, added an if not dir_name.exists(): continue that fixed it.

I do have a couple issues on the produced ZIM:

Many covers are missing. We were told that all books should have covers so either something's broken or we were fooled :)
Default (no cover) is to display the gutenberg favicon. It's not looking good and because it was supposed to receive a cover, the fixed size (50x70px) makes it look stretched. Obviously we should fix the cover issue but this fallback should be better (visibility: hidden maybe?)

Screen Shot 2020-06-10 at 12 33 13

I don't have any PDF… I think something's broken as I have missing logs for each book (1342 do have a PDF on the previous ZIM)

[gutenbergtozim::2020-06-10 12:30:46,900] INFO:downloaded dl-cache/1342/optimized/Pride and Prejudice.1342.epub from cache at 1342/epub
[gutenbergtozim::2020-06-10 12:30:46,902] DEBUG:b'[pdf] not avail. for #1342# Pride and Prejudice'
[gutenbergtozim::2020-06-10 12:30:46,905] DEBUG:b'[html] Requesting URLs for #1342# Pride and Prejudice'

it seems the dl-cache is not used on future runs. Whenever I call the scraper, it downloads from S3.
have some errors in the logs

[gutenbergtozim::2020-06-10 12:03:23,503] INFO:uploaded tmp/40248.zip to cache at 40248/html
[gutenbergtozim::2020-06-10 12:03:23,509] INFO:     Creating optimized EPUB file dl-cache/40248/unoptimized/40248.epub
[gutenbergtozim::2020-06-10 12:03:23,511] INFO:     Creating ePUB off dl-cache/40248/unoptimized/40248.epub at /Users/reg/src/gutenberg/tmp/tmp9sxo0wb_.epub
[gutenbergtozim::2020-06-10 12:03:23,511] ERROR:[Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
Traceback (most recent call last):
  File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 831, in handle_unoptimized_files
    s3_storage=s3_storage,
  File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 754, in handle_companion_file
    optimize_epub(src, tmp_epub.name)
  File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 656, in optimize_epub
    with zipfile.ZipFile(src, "r") as zf:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py", line 1113, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
[gutenbergtozim::2020-06-10 12:03:23,516] ERROR:        Exception while handling companion file: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
[gutenbergtozim::2020-06-10 12:03:23,520] INFO:     Exporting to static/Voyage autour de ma chambre_cover.40248.html
[gutenbergtozim::2020-06-10 12:03:23,558] INFO:uploaded static/28397_cover_image.jpg to cache at 28397/cover

the companion optimization seems to happen always

satyamtg commented 4 years ago

Many covers are missing. We were told that all books should have covers so either something's broken or we were fooled :)

This was precisely due to 2 problems -

The name of the attribute of Book which was used to store whether cover is present or not, was different in database.py and in rdf.py
Some books do not have covers. See the book 65 (its a rare example, but its there)

I don't have any PDF… I think something's broken as I have missing logs for each book (1342 do have a PDF on the previous ZIM)

This is due to inconsistency between what's on the server and what's mentioned in the RDF. See the RDF of book 1342 (attached here as pg1342.zip), you won't see any pdf. However, we do have pdf version on the server here pg1342.zip

What we can do is simply find links from the database for missing formats for a book and add it to the list.

have some errors in the logs

[gutenbergtozim::2020-06-10 12:03:23,503] INFO:uploaded tmp/40248.zip to cache at 40248/html
[gutenbergtozim::2020-06-10 12:03:23,509] INFO:       Creating optimized EPUB file dl-cache/40248/unoptimized/40248.epub
[gutenbergtozim::2020-06-10 12:03:23,511] INFO:       Creating ePUB off dl-cache/40248/unoptimized/40248.epub at /Users/reg/src/gutenberg/tmp/tmp9sxo0wb_.epub
[gutenbergtozim::2020-06-10 12:03:23,511] ERROR:[Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
Traceback (most recent call last):
  File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 831, in handle_unoptimized_files
    s3_storage=s3_storage,
  File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 754, in handle_companion_file
    optimize_epub(src, tmp_epub.name)
  File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 656, in optimize_epub
    with zipfile.ZipFile(src, "r") as zf:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py", line 1113, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
[gutenbergtozim::2020-06-10 12:03:23,516] ERROR:      Exception while handling companion file: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
[gutenbergtozim::2020-06-10 12:03:23,520] INFO:       Exporting to static/Voyage autour de ma chambre_cover.40248.html
[gutenbergtozim::2020-06-10 12:03:23,558] INFO:uploaded static/28397_cover_image.jpg to cache at 28397/cover

This was due to some format change in epub naming. We now have 2 versions, named xxx-images.epub and xxx-noimages.epub . This has been addressed (locally) and will be fixed in next push.

it seems the dl-cache is not used on future runs. Whenever I call the scraper, it downloads from S3.

It skips only if force is false.

rgaudin commented 4 years ago

Hum, S3 cache doesn't work…

Traceback (most recent call last):
  File "gutenberg2zim", line 276, in <module>
    main(docopt(help, version=VERSION))
  File "gutenberg2zim", line 205, in main
    else None,
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 377, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 374, in dlb
    b, download_cache, languages, formats, force, s3_storage, optimizer_version
  File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 286, in download_book
    optimizer_version=optimizer_version,
  File "/Users/reg/src/gutenberg/gutenbergtozim/s3.py", line 44, in download_from_cache
    meta.get("optimizer_version") != optimizer_version[book_format]
TypeError: string indices must be integers

This is due to OPTIMIZER_VERSION not being a dict…
Also, even when setting it for cover format, it doesn't seem to be used and I always get ERROR:etag doesn't match for 30602/cover. Expected None, got "5ebe5b77-45af"
I'm not sure to understand why we're passing down a constant to all those methods… probably because you change the value in entrypoint? How would meta.get("optimizer_version") != optimizer_version[book_format] work if optimizer_version is None? This will raise a TypeError, right?

Other issues I'll open tickets for as I want to see this merged in first:

I also still don't see covers on homepage (but those are present in article page)
I still don't have no PDF.
It seems we are optimizing covers at the very end, when exporting files. Is that relevant? Don't we already have an optimized cover at hand?

I also have error logs like

[gutenbergtozim::2020-06-16 09:22:49,099] ERROR:http://aleph.gutenberg.org/cache/epub/2650/pg2650.cover.medium.jpg > Problem while head request
('Connection aborted.', BadStatusLine('Date: Tue, 16 Jun 2020 09:06:38 GMT\r\n',))

[gutenbergtozim::2020-06-16 09:22:49,099] ERROR:http://aleph.gutenberg.org/cache/epub/28397/pg28397.cover.medium.jpg > Problem while head request
('Connection aborted.', BadStatusLine('Date: Tue, 16 Jun 2020 09:06:32 GMT\r\n',))

[gutenbergtozim::2020-06-16 09:22:49,100] ERROR:http://aleph.gutenberg.org/cache/epub/30602/pg30602.cover.medium.jpg > Problem while head request
('Connection aborted.', BadStatusLine('Date: Tue, 16 Jun 2020 09:06:38 GMT\r\n',))

It seems that this URL doesn't allow HEAD requests. Not sure why. Is this common?

satyamtg commented 4 years ago

This is due to OPTIMIZER_VERSION not being a dict…

Ah. My bad. I forgot adding gutenberg2zim while pushing. Fixed that.

I'm not sure to understand why we're passing down a constant to all those methods… probably because you change the value in entrypoint? How would meta.get("optimizer_version") != optimizer_version[book_format] work if optimizer_version is None? This will raise a TypeError, right?

Yup. You're right. Fixed that with adding a condition.

It seems that this URL doesn't allow HEAD requests. Not sure why. Is this common?

Okay. This is strange. I never got these errors, (As discussed over slack) Also, my understanding is that the following also happened due to this. ERROR:etag doesn't match for 30602/cover. Expected None, got "5ebe5b77-45af"

I also still don't see covers on homepage (but those are present in article page)

This is fixed now. This was due to the fact that I changed the nameing format for covers. (To avoid clashes with other assets)

I still don't have no PDF.

I have 4 of them in the first 151 books + 1342. I mean I get PDFs for 1342 which you mentioned earlier.

eshellman commented 4 years ago

No. 65 has been reclassified as "data" at PG, so it should no longer be a problem. If there's no still no cover, it usually means that the html is so bad no epub could be made, and for Openzim purposes, it should be discarded.

eshellman commented 4 years ago

Books that have old-style file naming are gradually be reworked - they are likely to be completely gone by the end of the year.

rgaudin commented 4 years ago

OK the S3 cache seems to be working fine now but I still have a problem with PDF. I see in the log that the PDF is downloaded for 1342 and I see that I have the file on the disk and logs says it's being copied to static folder

[gutenbergtozim::2020-06-16 16:29:31,994] DEBUG:curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/1/3/4/1342/1342-pdf.pdf --output /Users/reg/src/gutenberg/dl-cache/1342/unoptimized/1342.pdf

.rw-r--r--  1.6M reg  staff 16 Jun 16:46  ./dl-cache/1342/unoptimized/1342.pdf

[gutenbergtozim::2020-06-16 16:29:44,243] INFO: Exporting Book #1342.
[gutenbergtozim::2020-06-16 16:29:44,243] WARNING:Missing HTML content for #1342 at dl-cache/1342/unoptimized/1342.html
[gutenbergtozim::2020-06-16 16:29:44,248] INFO:     Exporting to static/Du côté de chez Swann_cover.2650.html
[gutenbergtozim::2020-06-16 16:29:44,256] DEBUG:        Skipping existing companion Pride and Prejudice.1342.epub
[gutenbergtozim::2020-06-16 16:29:44,266] INFO:     Copying companion file to static/Pride and Prejudice.1342.pdf
[gutenbergtozim::2020-06-16 16:29:44,266] INFO:     Copying static/Pride and Prejudice.1342.pdf
[gutenbergtozim::2020-06-16 16:29:44,294] INFO:     Exporting to static/Pride and Prejudice_cover.1342.html

But it's not in the ZIM file (at /I/Pride and Prejudice.1342.pdf) and it's not listed neither in home or cover article.

eshellman commented 4 years ago

the 1342 pdf file is in an "old" directory, meaning it's not used anymore. I would remove any files from an "old" directory.

rgaudin commented 4 years ago

the 1342 pdf file is in an "old" directory, meaning it's not used anymore. I would remove any files from an "old" directory.

Thanks @eshellman, that we can do !