Closed satyamtg closed 4 years ago
Currently, this S3 implementation works for most of the part. I have successfully tested it for both S3 upload and download with Wasabi. However, some small things still remain to test and implement, one of them being covers. Currently covers are zipped in the HTML format and uploaded there. I don't think this should be a problem as we anyways add "html" to formats if it's not there. What do you think of it @rgaudin ?
Also, I switched to logger from zimscraperlib as with previous implementation all logs from boto3 were also shown which made debugging a nightmare.
Currently it works as follows -
Anything downloaded from optimization cache is renamed as <optimized>_<book id>_<filename in static folder>
, so, we just copy these to static folder while exporting, create cover articles for them and skip optimization. I did this as this seems to be a simple way to solve the problem of exploded procedures as we also maintain a download cache. Also, we associate Etag with book to store that in DB to be used while uploading.
Most of the optimized files are uploaded to cache and it works perfectly. However, there's some HTMLs which do not follow the URL pattern in aleph.gutenberg.org (didn't do the full rsync, so all combinations were tried), and hence do not have an ETag. Those are not uploaded to cache and are always downloaded from the site. These are very few in numbers. So, should we upload them to cache with some default ETag or should we leave them (which is currently done).
Also, I've done the following changes -
What do you think @kelson42 , @rgaudin , @dattaz ?
@rgaudin thanks for that comprehensive review. I have done several changes addressing it. The following has changed -
Here's where I'm stuck at:
python gutenberg2zim -l fr -b 40248,28397,2650,30602 --download --export --zim --force --title-search --bookshelves --optimization-cache="https://s3.us-west-1.wasabisys.com/?keyId=AA&secretAccessKey=BB&bucketName=org-kiwix-dev-gutenberg"
[gutenbergtozim::2020-06-10 11:17:25,576] INFO:testing S3 Optimization Cache credentials
removing `AWS_PROFILE` variable from environment
[gutenbergtozim::2020-06-10 11:17:30,742] INFO:S3 Credentials OK. Continuing ...
[gutenbergtozim::2020-06-10 11:17:30,743] INFO:SETTING UP DATABASE
[gutenbergtozim::2020-06-10 11:17:30,743] INFO:Setting up the database
[gutenbergtozim::2020-06-10 11:17:30,744] DEBUG:license table already exists.
[gutenbergtozim::2020-06-10 11:17:30,745] DEBUG:format table already exists.
[gutenbergtozim::2020-06-10 11:17:30,745] DEBUG:author table already exists.
[gutenbergtozim::2020-06-10 11:17:30,746] DEBUG:book table already exists.
[gutenbergtozim::2020-06-10 11:17:30,746] DEBUG:bookformat table already exists.
[gutenbergtozim::2020-06-10 11:17:30,747] DEBUG:url table already exists.
[gutenbergtozim::2020-06-10 11:17:30,747] INFO:DOWNLOADING ebooks from mirror using filters
[gutenbergtozim::2020-06-10 11:17:30,766] INFO: Downloading content files for Book #2650
[gutenbergtozim::2020-06-10 11:17:30,767] INFO: Downloading content files for Book #28397
[gutenbergtozim::2020-06-10 11:17:30,767] INFO: Downloading content files for Book #30602
[gutenbergtozim::2020-06-10 11:17:30,767] INFO: Downloading content files for Book #40248
Traceback (most recent call last):
File "gutenberg2zim", line 276, in <module>
main(docopt(help, version=VERSION))
File "gutenberg2zim", line 205, in main
else None,
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 372, in download_all_books
Pool(concurrency).map(dlb, available_books)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 369, in dlb
b, download_cache, languages, formats, force, s3_storage, optimizer_version
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 145, in download_book
if not [fl for fl in dir_name.iterdir()]:
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 145, in <listcomp>
if not [fl for fl in dir_name.iterdir()]:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pathlib.py", line 1081, in iterdir
for name in self._accessor.listdir(self):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pathlib.py", line 387, in wrapped
return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/2650/optimized'
Here's where I'm stuck at:
I think this should solve that (line 142) -
for dir_name in [optimized_dir, unoptimized_dir] and dir_name.exists():
Will change this and some other things that you pointed out.
Thanks, added an if not dir_name.exists(): continue
that fixed it.
I do have a couple issues on the produced ZIM:
visibility: hidden
maybe?)[gutenbergtozim::2020-06-10 12:30:46,900] INFO:downloaded dl-cache/1342/optimized/Pride and Prejudice.1342.epub from cache at 1342/epub
[gutenbergtozim::2020-06-10 12:30:46,902] DEBUG:b'[pdf] not avail. for #1342# Pride and Prejudice'
[gutenbergtozim::2020-06-10 12:30:46,905] DEBUG:b'[html] Requesting URLs for #1342# Pride and Prejudice'
dl-cache
is not used on future runs. Whenever I call the scraper, it downloads from S3.[gutenbergtozim::2020-06-10 12:03:23,503] INFO:uploaded tmp/40248.zip to cache at 40248/html
[gutenbergtozim::2020-06-10 12:03:23,509] INFO: Creating optimized EPUB file dl-cache/40248/unoptimized/40248.epub
[gutenbergtozim::2020-06-10 12:03:23,511] INFO: Creating ePUB off dl-cache/40248/unoptimized/40248.epub at /Users/reg/src/gutenberg/tmp/tmp9sxo0wb_.epub
[gutenbergtozim::2020-06-10 12:03:23,511] ERROR:[Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
Traceback (most recent call last):
File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 831, in handle_unoptimized_files
s3_storage=s3_storage,
File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 754, in handle_companion_file
optimize_epub(src, tmp_epub.name)
File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 656, in optimize_epub
with zipfile.ZipFile(src, "r") as zf:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py", line 1113, in __init__
self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
[gutenbergtozim::2020-06-10 12:03:23,516] ERROR: Exception while handling companion file: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub'
[gutenbergtozim::2020-06-10 12:03:23,520] INFO: Exporting to static/Voyage autour de ma chambre_cover.40248.html
[gutenbergtozim::2020-06-10 12:03:23,558] INFO:uploaded static/28397_cover_image.jpg to cache at 28397/cover
- Many covers are missing. We were told that all books should have covers so either something's broken or we were fooled :)
This was precisely due to 2 problems -
- I don't have any PDF… I think something's broken as I have missing logs for each book (1342 do have a PDF on the previous ZIM)
This is due to inconsistency between what's on the server and what's mentioned in the RDF. See the RDF of book 1342 (attached here as pg1342.zip), you won't see any pdf. However, we do have pdf version on the server here pg1342.zip
What we can do is simply find links from the database for missing formats for a book and add it to the list.
- have some errors in the logs
[gutenbergtozim::2020-06-10 12:03:23,503] INFO:uploaded tmp/40248.zip to cache at 40248/html [gutenbergtozim::2020-06-10 12:03:23,509] INFO: Creating optimized EPUB file dl-cache/40248/unoptimized/40248.epub [gutenbergtozim::2020-06-10 12:03:23,511] INFO: Creating ePUB off dl-cache/40248/unoptimized/40248.epub at /Users/reg/src/gutenberg/tmp/tmp9sxo0wb_.epub [gutenbergtozim::2020-06-10 12:03:23,511] ERROR:[Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub' Traceback (most recent call last): File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 831, in handle_unoptimized_files s3_storage=s3_storage, File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 754, in handle_companion_file optimize_epub(src, tmp_epub.name) File "/Users/reg/src/gutenberg/gutenbergtozim/export.py", line 656, in optimize_epub with zipfile.ZipFile(src, "r") as zf: File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py", line 1113, in __init__ self.fp = io.open(file, filemode) FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub' [gutenbergtozim::2020-06-10 12:03:23,516] ERROR: Exception while handling companion file: [Errno 2] No such file or directory: 'dl-cache/40248/unoptimized/40248.epub' [gutenbergtozim::2020-06-10 12:03:23,520] INFO: Exporting to static/Voyage autour de ma chambre_cover.40248.html [gutenbergtozim::2020-06-10 12:03:23,558] INFO:uploaded static/28397_cover_image.jpg to cache at 28397/cover
This was due to some format change in epub naming. We now have 2 versions, named xxx-images.epub and xxx-noimages.epub . This has been addressed (locally) and will be fixed in next push.
- it seems the
dl-cache
is not used on future runs. Whenever I call the scraper, it downloads from S3.
It skips only if force is false.
Hum, S3 cache doesn't work…
Traceback (most recent call last):
File "gutenberg2zim", line 276, in <module>
main(docopt(help, version=VERSION))
File "gutenberg2zim", line 205, in main
else None,
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 377, in download_all_books
Pool(concurrency).map(dlb, available_books)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 374, in dlb
b, download_cache, languages, formats, force, s3_storage, optimizer_version
File "/Users/reg/src/gutenberg/gutenbergtozim/download.py", line 286, in download_book
optimizer_version=optimizer_version,
File "/Users/reg/src/gutenberg/gutenbergtozim/s3.py", line 44, in download_from_cache
meta.get("optimizer_version") != optimizer_version[book_format]
TypeError: string indices must be integers
OPTIMIZER_VERSION
not being a dict…cover
format, it doesn't seem to be used and I always get ERROR:etag doesn't match for 30602/cover. Expected None, got "5ebe5b77-45af"
meta.get("optimizer_version") != optimizer_version[book_format]
work if optimizer_version
is None
? This will raise a TypeError, right?Other issues I'll open tickets for as I want to see this merged in first:
I also have error logs like
[gutenbergtozim::2020-06-16 09:22:49,099] ERROR:http://aleph.gutenberg.org/cache/epub/2650/pg2650.cover.medium.jpg > Problem while head request
('Connection aborted.', BadStatusLine('Date: Tue, 16 Jun 2020 09:06:38 GMT\r\n',))
[gutenbergtozim::2020-06-16 09:22:49,099] ERROR:http://aleph.gutenberg.org/cache/epub/28397/pg28397.cover.medium.jpg > Problem while head request
('Connection aborted.', BadStatusLine('Date: Tue, 16 Jun 2020 09:06:32 GMT\r\n',))
[gutenbergtozim::2020-06-16 09:22:49,100] ERROR:http://aleph.gutenberg.org/cache/epub/30602/pg30602.cover.medium.jpg > Problem while head request
('Connection aborted.', BadStatusLine('Date: Tue, 16 Jun 2020 09:06:38 GMT\r\n',))
It seems that this URL doesn't allow HEAD requests. Not sure why. Is this common?
- This is due to
OPTIMIZER_VERSION
not being a dict…
Ah. My bad. I forgot adding gutenberg2zim while pushing. Fixed that.
- I'm not sure to understand why we're passing down a constant to all those methods… probably because you change the value in entrypoint? How would
meta.get("optimizer_version") != optimizer_version[book_format]
work ifoptimizer_version
isNone
? This will raise a TypeError, right?
Yup. You're right. Fixed that with adding a condition.
It seems that this URL doesn't allow HEAD requests. Not sure why. Is this common?
Okay. This is strange. I never got these errors, (As discussed over slack)
Also, my understanding is that the following also happened due to this.
ERROR:etag doesn't match for 30602/cover. Expected None, got "5ebe5b77-45af"
I also still don't see covers on homepage (but those are present in article page)
This is fixed now. This was due to the fact that I changed the nameing format for covers. (To avoid clashes with other assets)
I still don't have no PDF.
I have 4 of them in the first 151 books + 1342. I mean I get PDFs for 1342 which you mentioned earlier.
No. 65 has been reclassified as "data" at PG, so it should no longer be a problem. If there's no still no cover, it usually means that the html is so bad no epub could be made, and for Openzim purposes, it should be discarded.
Books that have old-style file naming are gradually be reworked - they are likely to be completely gone by the end of the year.
OK the S3 cache seems to be working fine now but I still have a problem with PDF. I see in the log that the PDF is downloaded for 1342 and I see that I have the file on the disk and logs says it's being copied to static folder
[gutenbergtozim::2020-06-16 16:29:31,994] DEBUG:curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/1/3/4/1342/1342-pdf.pdf --output /Users/reg/src/gutenberg/dl-cache/1342/unoptimized/1342.pdf
.rw-r--r-- 1.6M reg staff 16 Jun 16:46 ./dl-cache/1342/unoptimized/1342.pdf
[gutenbergtozim::2020-06-16 16:29:44,243] INFO: Exporting Book #1342.
[gutenbergtozim::2020-06-16 16:29:44,243] WARNING:Missing HTML content for #1342 at dl-cache/1342/unoptimized/1342.html
[gutenbergtozim::2020-06-16 16:29:44,248] INFO: Exporting to static/Du côté de chez Swann_cover.2650.html
[gutenbergtozim::2020-06-16 16:29:44,256] DEBUG: Skipping existing companion Pride and Prejudice.1342.epub
[gutenbergtozim::2020-06-16 16:29:44,266] INFO: Copying companion file to static/Pride and Prejudice.1342.pdf
[gutenbergtozim::2020-06-16 16:29:44,266] INFO: Copying static/Pride and Prejudice.1342.pdf
[gutenbergtozim::2020-06-16 16:29:44,294] INFO: Exporting to static/Pride and Prejudice_cover.1342.html
But it's not in the ZIM file (at /I/Pride and Prejudice.1342.pdf
) and it's not listed neither in home or cover article.
the 1342 pdf file is in an "old" directory, meaning it's not used anymore. I would remove any files from an "old" directory.
the 1342 pdf file is in an "old" directory, meaning it's not used anymore. I would remove any files from an "old" directory.
Thanks @eshellman, that we can do !
THis will fix #101 by introducing S3 based optimization cache