Enhance RDF parsing duration

benoit74 commented 1 year ago

As discussed in #97, this PR is a proposition to process the rdf-files.tar.bz2 only from memory. It also has the advantage to use a pure Python implementation for tar processing instead of tar utility.

Previous behavior was:

download the rdf-files.tar.bz2 file from server and save it on the local filesystem (if not already present)
extract the individual RDF files on the local filesystem (if not already done)
process RDF files in parallel (with a default concurrency of 16)

New behavior is:

download the rdf-files.tar.bz2 file from server and save it on the local filesystem (if not already present)
process RDF files in sequence from memory, decompression via streaming of the tar.bz2 content

Some timings: WIP

Some notes:

there is a significant improvement of streaming the decompression, since it allows to open the file in an efficient manner and not seek in it ; it is faster than parallel processing of the files
the filtering by book_id (when specified) is still done based on the RDF file name, i.e. we assume that file pg_1234.rdf contains the book id 1234
the remaining processing time with new behavior is about 1/3 for extractin/parsing the RDF file and 2/3 to process the content/insert it into local DB

benoit74 commented 1 year ago

Looks like I found a significant issue around RDF parsing / DB filling. All timings below are from my machine of course.

A very significant portion of the parsing phase is spent in the save_rdf_in_database function. Performance is getting worse as we have more books in the database (from 43 secs to process the first 1000 books to 358 secs to process the last 1000 books - logically/luckily this is not a linear function of the number of books, more a log, i.e. it degrades fast at the beginning, then it mostly stop to degrade). Looking at cProfile data, most time is spent in get_or_create peewee functions. I realized this function is obviously costly for tables with lots of records like the BookFormattable. This function is even most costly with current schema since there is no primary key (composite with book_id + format_id) on the table, so no index.

Adding a composite key on this table enhanced significantly the situation (40 secs instead of 1min10secs on my machine to extract from tar and parse and save in database the first 1000 books), but it still not very fast.

Looking a bit deeper in the schema, I realized that the Format table is not really needing since we always process from Book then Format then BookFormat. We never perform a query from Format to Book. It is henc probably better to denormalize this Format table directly in the BookFormat table. It means that we will create many duplicate information (all the format one) in the BookFormat table but allows to not need a get_or_create anymore, we can create it directly.

This is what the last commit is about. It still needs more testing but it looks like it is working well with my first tests. It is very very fast, needing only 22 secs to process the first 1000 books (which is exceptional since 21 secs is used to extract + parse the RDF, i.e. only 1 sec is used to store info in database).

I will continue to test this, mainly doing a full parse of all books.

rgaudin commented 1 year ago

OK, let me know when you're satisfied and then I'll test myself

benoit74 commented 1 year ago

@rgaudin I'm happy with the new code, looks good to me

Some timings for prepare + parse phases, code running on an AWS EC2 t3.micro :

code before the PR : 257 mins
code with RDF parsing from memory but without DB schema changes : 258 mins
code with RDF parsing from memory and DB schema changes : 36 mins

As expected / already mentioned, parsing RDF from memory (instead of extracting the tar on the filesystem) only has a significant benefit for testing (no need to extract the whole tar for testing only few books) but not for whole scraper run.

Code with DB schema changes is 7x faster ... I hope there is no other side effects I may have missed. I did not performed a full run with new schema and all phases, I only performed a run with new schema on all books and only prepare/parse phase, and a full run (including ZIM creation) with new schema on few books.

rgaudin commented 1 year ago

OK ; let's merge and launch a test run on the zimfarm for the whole thing

openzim / gutenberg

Enhance RDF parsing duration #164