openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Gutenberg scraping is running very slowly #205

Closed benoit74 closed 10 months ago

benoit74 commented 10 months ago

We just started the gutenberg_mul_all recipe on 2.1.0 few days ago: https://farm.openzim.org/pipeline/742bc05b-9ac3-4a03-8c84-5e8a00095a7b/debug

The recipe is running on michaelblob worker which gave pretty decent performance in the past.

Here, after more than two days we are still at the point of processing Rsync results (i.e. basically we have retrieved the big tar.gz with all RDFs, we have parsed all RDFs into the DB, we have retrieved the Rsync content and we are processing it).

On my machine reaching the end of the PARSING phase takes more or less 25 minutes. Here it took more or less 18 hours.

Could it be linked to some performance issue on michaelblob due to other recipes or some other things running on the machine ?

rgaudin commented 10 months ago

michaelblob has other tasks running (started before, still updating) so we can rule out docker on the worker to have generally failed (can happen)

I'd advise you run the very same command on your machine to ensure there's no regression and still get to this point under 30mn. The startup of gutenberg relies both on rsync download (from a new server now) and on a filesystem intensive parsing (which can be incredibly slower on a busy machine with mechanical HDD rather than a developer's SSD).

benoit74 commented 10 months ago

The very same command (just had to remove optimization cache URL because my machine has no access) ran mostly as I saw before.

But you are right about the fact that parsing requires lots of CPU + Disk IO, and I'm alone on my developer SSD.

docker run -d --rm ghcr.io/openzim/gutenberg:2.1.0 gutenberg2zim --one-language-one-zim=/output --bookshelves

The exact order of operations at scraper startup is:

I get to the rsync operations in 38 minutes (rsync has not yet started) while it took 18 hours on michaelblob.

--2023-08-21 13:06:46--  http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 [following]
--2023-08-21 13:06:46--  https://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61737778 (59M) [application/x-bzip2]
Saving to: ‘/output/rdf-files.tar.bz2’

     0K .......... .......... .......... .......... ..........  0%  253K 3m58s
    50K .......... .......... .......... .......... ..........  0%  750K 2m39s
[gutenberg2zim.constants::2023-08-21 13:06:46,214] INFO:CHECKING for dependencies on the system
[gutenberg2zim.constants::2023-08-21 13:06:46,226] INFO:PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
[gutenberg2zim.constants::2023-08-21 13:06:46,226] INFO:        Downloading http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 into /output/rdf-files.tar.bz2
...
 60150K .......... .......... .......... .......... .......... 99% 20.9M 0s
 60200K .......... .......... .......... .......... .......... 99% 21.5M 0s
 60250K .......... .......... .......... ..........           100%  300K=8.0s

2023-08-21 13:06:57 (7.33 MB/s) - ‘/output/rdf-files.tar.bz2’ saved [61737778/61737778]

[gutenberg2zim.constants::2023-08-21 13:06:57,420] INFO:SETTING UP DATABASE
[gutenberg2zim.constants::2023-08-21 13:06:57,420] INFO:Setting up the database
[gutenberg2zim.constants::2023-08-21 13:06:57,427] DEBUG:Created table for license
[gutenberg2zim.constants::2023-08-21 13:06:57,427] INFO:Loading fixtures for license
[gutenberg2zim.constants::2023-08-21 13:06:57,429] DEBUG:[fixtures] Created Public domain in the USA.
[gutenberg2zim.constants::2023-08-21 13:06:57,431] DEBUG:[fixtures] Created None
[gutenberg2zim.constants::2023-08-21 13:06:57,433] DEBUG:[fixtures] Created Copyrighted. Read the copyright notice inside this book for details.
[gutenberg2zim.constants::2023-08-21 13:06:57,434] DEBUG:Created table for author
[gutenberg2zim.constants::2023-08-21 13:06:57,434] INFO:Loading fixtures for author
[gutenberg2zim.constants::2023-08-21 13:06:57,436] DEBUG:[fixtures] Created Various
[gutenberg2zim.constants::2023-08-21 13:06:57,438] DEBUG:[fixtures] Created Anonymous
[gutenberg2zim.constants::2023-08-21 13:06:57,442] DEBUG:Created table for book
[gutenberg2zim.constants::2023-08-21 13:06:57,443] INFO:Loading fixtures for book
[gutenberg2zim.constants::2023-08-21 13:06:57,446] DEBUG:Created table for bookformat
[gutenberg2zim.constants::2023-08-21 13:06:57,446] INFO:Loading fixtures for bookformat
[gutenberg2zim.constants::2023-08-21 13:06:57,448] DEBUG:Created table for url
[gutenberg2zim.constants::2023-08-21 13:06:57,449] INFO:Loading fixtures for url
[gutenberg2zim.constants::2023-08-21 13:06:57,449] INFO:PARSING rdf-files in /output/rdf-files.tar.bz2
[gutenberg2zim.constants::2023-08-21 13:06:57,449] INFO:        Looping throught RDF files in /output/rdf-files.tar.bz2
[gutenberg2zim.constants::2023-08-21 13:06:57,494] INFO:        Parsing file cache/epub/10000/pg10000.rdf for book id 10000
[gutenberg2zim.constants::2023-08-21 13:06:57,539] INFO:        Parsing file cache/epub/10001/pg10001.rdf for book id 10001
[gutenberg2zim.constants::2023-08-21 13:06:57,582] INFO:        Parsing file cache/epub/10002/pg10002.rdf for book id 10002
[gutenberg2zim.constants::2023-08-21 13:06:57,625] INFO:        Parsing file cache/epub/10003/pg10003.rdf for book id 10003
[gutenberg2zim.constants::2023-08-21 13:06:57,667] INFO:        Parsing file cache/epub/10004/pg10004.rdf for book id 10004
[gutenberg2zim.constants::2023-08-21 13:06:57,710] INFO:        Parsing file cache/epub/10005/pg10005.rdf for book id 10005
[gutenberg2zim.constants::2023-08-21 13:06:57,751] INFO:        Parsing file cache/epub/10006/pg10006.rdf for book id 10006
[gutenberg2zim.constants::2023-08-21 13:06:57,798] INFO:        Parsing file cache/epub/10007/pg10007.rdf for book id 10007
[gutenberg2zim.constants::2023-08-21 13:06:57,839] INFO:        Parsing file cache/epub/10008/pg10008.rdf for book id 10008
[gutenberg2zim.constants::2023-08-21 13:06:57,890] INFO:        Parsing file cache/epub/10009/pg10009.rdf for book id 10009

... tens of thousands of lines ommitted ...

[gutenberg2zim.constants::2023-08-21 13:44:06,784] INFO:    Parsing file cache/epub/9998/pg9998.rdf for book id 9998
[gutenberg2zim.constants::2023-08-21 13:44:06,809] INFO:    Parsing file cache/epub/9999/pg9999.rdf for book id 9999
[gutenberg2zim.constants::2023-08-21 13:44:06,837] INFO:    Parsing file cache/epub/999/pg999.rdf for book id 999
[gutenberg2zim.constants::2023-08-21 13:44:06,866] INFO:    Parsing file cache/epub/99/pg99.rdf for book id 99
[gutenberg2zim.constants::2023-08-21 13:44:06,892] INFO:    Parsing file cache/epub/9/pg9.rdf for book id 9
[gutenberg2zim.constants::2023-08-21 13:44:06,972] INFO:Add possible url to db
[gutenberg2zim.constants::2023-08-21 13:44:06,972] DEBUG:bash -c rsync -a --list-only rsync://aleph.pglaf.org/gutenberg/ > tmp/file_on_aleph_pglaf_org

I ran it on another machine with more limited CPU and it was very significantly slower (I did not let the task finish, but progress was significantly slower).

From this point, I imagine this is most probably a disk and/or CPU issue on michaelblob.

Is there something we can do? Should we restart the task on athena or just wait?

rgaudin commented 10 months ago

There's no point in letting it run. You can restart and see which worker picks it up and observe.

benoit74 commented 10 months ago

I cancelled and restarted, but michaelblob pick it again and showed poor performance again. I cancelled again and restarted it on athena, we will have to wait a bit more but at least it will run an another worker :)

benoit74 commented 10 months ago

Run on athena is behaving much better. It reach the rsync operations in 53 minutes, which is pretty decent.

I suggest to close this issue, WDYT?

rgaudin commented 10 months ago

OK