openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Enhance RDF parsing duration #164

Closed benoit74 closed 1 year ago

benoit74 commented 1 year ago

As discussed in #97, this PR is a proposition to process the rdf-files.tar.bz2 only from memory. It also has the advantage to use a pure Python implementation for tar processing instead of tar utility.

Previous behavior was:

New behavior is:

Some timings: WIP

Some notes:

benoit74 commented 1 year ago

Looks like I found a significant issue around RDF parsing / DB filling. All timings below are from my machine of course.

A very significant portion of the parsing phase is spent in the save_rdf_in_database function. Performance is getting worse as we have more books in the database (from 43 secs to process the first 1000 books to 358 secs to process the last 1000 books - logically/luckily this is not a linear function of the number of books, more a log, i.e. it degrades fast at the beginning, then it mostly stop to degrade). Looking at cProfile data, most time is spent in get_or_create peewee functions. I realized this function is obviously costly for tables with lots of records like the BookFormattable. This function is even most costly with current schema since there is no primary key (composite with book_id + format_id) on the table, so no index.

Adding a composite key on this table enhanced significantly the situation (40 secs instead of 1min10secs on my machine to extract from tar and parse and save in database the first 1000 books), but it still not very fast.

Looking a bit deeper in the schema, I realized that the Format table is not really needing since we always process from Book then Format then BookFormat. We never perform a query from Format to Book. It is henc probably better to denormalize this Format table directly in the BookFormat table. It means that we will create many duplicate information (all the format one) in the BookFormat table but allows to not need a get_or_create anymore, we can create it directly.

This is what the last commit is about. It still needs more testing but it looks like it is working well with my first tests. It is very very fast, needing only 22 secs to process the first 1000 books (which is exceptional since 21 secs is used to extract + parse the RDF, i.e. only 1 sec is used to store info in database).

I will continue to test this, mainly doing a full parse of all books.

rgaudin commented 1 year ago

OK, let me know when you're satisfied and then I'll test myself

benoit74 commented 1 year ago

@rgaudin I'm happy with the new code, looks good to me

Some timings for prepare + parse phases, code running on an AWS EC2 t3.micro :

As expected / already mentioned, parsing RDF from memory (instead of extracting the tar on the filesystem) only has a significant benefit for testing (no need to extract the whole tar for testing only few books) but not for whole scraper run.

Code with DB schema changes is 7x faster ... I hope there is no other side effects I may have missed. I did not performed a full run with new schema and all phases, I only performed a run with new schema on all books and only prepare/parse phase, and a full run (including ZIM creation) with new schema on few books.

rgaudin commented 1 year ago

OK ; let's merge and launch a test run on the zimfarm for the whole thing