peterwebster / henson

Master data store for the Hensley Henson Journals project, and issue tracker. The application code is kept elsewhere.
1 stars 1 forks source link

Requests timing out if next_journal or prev_journal missing #251

Closed ojlyytinen closed 4 years ago

ojlyytinen commented 4 years ago

If a journal is missing next_journal or prev_journal then requests take so long that they will timeout.

journals_controller will attempt to find next and prev journal (in get_next_journal and get_prev_journal) if they are not set. This bit of code

Journal.find{|f| f.sortKey.to_i == curr_journal.sortKey.to_i + 1 })

Will essentially try to load each and every Journal sequentially and then check if it is the next one. This will take a really long time and the request will just timeout.

This could be done faster with something like

Journal.where("sortKey_isim: [#{curr_journal.sortKey.to_i + 1} TO *]").first

I believe this should even cover the case where the sortKey number is missing and then get whatever is the next existing one.

peterwebster commented 4 years ago

@nomoregrapes could you possibly have a think about this?

For info to @ojlyytinen : the sortKey value is given in the TSV file which is ingested with every new batch of data, and so (in theory) should never be missing. next_journal and prev_journal are (I think) generated by the application, and I don't have a sense of why they appear sometimes but all the time.

peterwebster commented 4 years ago

Also, the actual sorting in search results is (I think) driven by the format of the filename, not any of these fields (there's a long story to that @nomoregrapes )

ojlyytinen commented 4 years ago

There's a slight complication in that the sortKey gets put in Solr as a multi-valued field which means that you can't order search results by it. I can search for the exact next sortKey value and if it exists then things are fine. But if one doesn't exist then what's the correct behaviour in that case?

Looking at the data in Solr, it looks like some journals do indeed miss sortKey altogether, at least in Solr. For example 04-12-11 is missing it in Solr but does actually have it in Fedora. This appears to be because in Fedora, for some reason, it's stored as a string rather than an integer and then it gets put to a different string valued field in Solr which largely just gets ignored. I could run a simple fix for all these records, there's about 300 of them. Assuming that whatever caused them to go in as strings is fixed then that should take care of that.

But if there still isn't a consecutive sortKey number then should next/prevJournal just be left blank? Or should it try to find whatever is the next existing sortKey? Even if the gap is a very large one?

peterwebster commented 4 years ago

Hi @ojlyytinen : this seems to be working well now on test, so do please push it to production.

ojlyytinen commented 4 years ago

I've pushed the changes to production and all seems to be working fine. The journal entries with missing next/prev Journal fields are loading fast now.

For future reference, as some of this was dealt with in emails, string sort keys converted to integers with a small script. get_next_journal and get_prev_journal functions in journal_controller and henson_scan_xml task were changed to only utilise Solr. It will try to first find a journal entry with sortKey exactly one bigger or smaller. If one is not found then it will look for one up to +/- 5. If it still can't find one then it will return nil and there won't be a link to next or previous journal.

Missing links are not written and saved. This could be done with a small script to resolve all missing links and storing them. This would speed up the controller slightly more where the links are missing. Though it is already pretty fast and there's not a huge number of missing links.

peterwebster commented 4 years ago

Tested and looking good. Many thanks @ojlyytinen - closing ticket now.