hathitrust / hathifiles

Generation of Hathfiles
0 stars 0 forks source link

Profile full monthly generation #52

Open aelkiss opened 3 days ago

aelkiss commented 3 days ago

Generating a full hathifile takes many, many hours. It's probably worth seeing what's so slow and seeing if we can speed it up. Could be database queries, could be json parsing (could try https://github.com/anilmaurya/fast_jsonparser)..

aelkiss commented 3 days ago

It looks like it does a database query for the rights db for every item, but it's only used for the access profile and rights update timestamp. We could consider:

  1. doing the queries in batches, ideally via the rights api
  2. doing a sort of offline join - basically get the zephir file sorted by id, dump the rights database also sorted by id, and then we can iterate through both - this has often been the most performant way I can come up with to handle these kinds of issues. We could potentially add an 'export all' option to the rights API to support this.