Closed benoit74 closed 1 year ago
Looks like I found a significant issue around RDF parsing / DB filling. All timings below are from my machine of course.
A very significant portion of the parsing phase is spent in the save_rdf_in_database
function. Performance is getting worse as we have more books in the database (from 43 secs to process the first 1000 books to 358 secs to process the last 1000 books - logically/luckily this is not a linear function of the number of books, more a log, i.e. it degrades fast at the beginning, then it mostly stop to degrade). Looking at cProfile
data, most time is spent in get_or_create
peewee functions. I realized this function is obviously costly for tables with lots of records like the BookFormat
table. This function is even most costly with current schema since there is no primary key (composite with book_id + format_id) on the table, so no index.
Adding a composite key on this table enhanced significantly the situation (40 secs instead of 1min10secs on my machine to extract from tar and parse and save in database the first 1000 books), but it still not very fast.
Looking a bit deeper in the schema, I realized that the Format
table is not really needing since we always process from Book
then Format
then BookFormat
. We never perform a query from Format
to Book
. It is henc probably better to denormalize this Format
table directly in the BookFormat
table. It means that we will create many duplicate information (all the format one) in the BookFormat
table but allows to not need a get_or_create
anymore, we can create
it directly.
This is what the last commit is about. It still needs more testing but it looks like it is working well with my first tests. It is very very fast, needing only 22 secs to process the first 1000 books (which is exceptional since 21 secs is used to extract + parse the RDF, i.e. only 1 sec is used to store info in database).
I will continue to test this, mainly doing a full parse of all books.
OK, let me know when you're satisfied and then I'll test myself
@rgaudin I'm happy with the new code, looks good to me
Some timings for prepare + parse phases, code running on an AWS EC2 t3.micro :
As expected / already mentioned, parsing RDF from memory (instead of extracting the tar on the filesystem) only has a significant benefit for testing (no need to extract the whole tar for testing only few books) but not for whole scraper run.
Code with DB schema changes is 7x faster ... I hope there is no other side effects I may have missed. I did not performed a full run with new schema and all phases, I only performed a run with new schema on all books and only prepare/parse phase, and a full run (including ZIM creation) with new schema on few books.
OK ; let's merge and launch a test run on the zimfarm for the whole thing
As discussed in #97, this PR is a proposition to process the rdf-files.tar.bz2 only from memory. It also has the advantage to use a pure Python implementation for tar processing instead of tar utility.
Previous behavior was:
New behavior is:
Some timings: WIP
Some notes: