sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
13 stars 0 forks source link

Investigate and optimize solr timeouts #2684

Open jrochkind opened 4 months ago

jrochkind commented 4 months ago

We intend to set a timeout on our app's connection to solr, to keep it from waiting indefinitely on a non-responsive solr. Becuase that happens sometime, and we don't want to tie up all our web workers waiting on non-responsive solr.

We intend to have it wait only 4 seconds for solr before giving up:

https://github.com/sciencehistory/scihist_digicoll/blob/53a1f0cd346f7456f547f37d9c5cf0852a419a1b/config/blacklight.yml#L10

Also, it is intended to retry twice, waiting only a fraction of a second between retries.

https://github.com/sciencehistory/scihist_digicoll/blob/53a1f0cd346f7456f547f37d9c5cf0852a419a1b/app/models/scihist/blacklight_solr_repository.rb#L40-L45

This means that if a Solr is being consistently unresponsive (or taking 10 seconds, or jsut more than 4, to respond), the app ought to wait: 4 seconds + ~1 second delay + 4 seconds + ~1 second delay + 4 seconds before giving up. Or around 14 seconds.

Which may already be actually far too long to tie up a worker on unresponsive solr? Maybe we want only ONE retry. (we have a retry in because in some cases SearchStax solr was being weird and a retry worked). Maybe less than 4 second timeout? (Although solr can be unpredictably slow).

But additionally, it's not clear to me that it really is working like this.

In times of Solr unresponsiveness, we are having heroku tell us lot sof responses are timing out at 30 seconds, heroku's own timeout. Suggesting they were being held up waiting for solr in sum mucb longer than the 14 seconds we expect too.

So it's possible the configuration isn't working like we want anyway -- it was a bit convoluted to try to set it up.

investigate how solr timeout/retry is actually working, with actual max time spent waiting for solr, then consider if we want to change it to keep app more reliable even when solr is not

eddierubeiz commented 4 days ago

Relevant lit

Tools

Findings

Suggestions Let's consider leaving the retry code as is, but changing the timeout from 4 to just 2 seconds.

jrochkind commented 4 days ago

Thanks for looking into this!

I could SWEAR I saw timeouts related to solr... but I"m again wondering if it was writing to solr. At any rate, if we can't find now, so be it. I guess if I find one again, I'll re-open, and document better!

Solr can be slow sometimes... I'm not sure how much more often we'd be giving people errors if we reduced to 2 seconds? (We DO have one retry configured for solr errors though, I think...)

jrochkind commented 4 days ago

the FAST subject-heading lookup service (which is never fast and often down).

OK, seprate ticket, but we got to get a timeout configured there which might require a PR or patch to QA, but we should do it! It should give up and assume no response long before 30 seconds!

Make another ticket at least? And maybe one for doing something abou tfixity report too? (Also we should review if any staff other than me is actually looking at it, and if we want to either encourage staff to look at it... or just remove it instead of sinking more time into it!)

eddierubeiz commented 4 days ago

We have a ticket to add a timeout to Fast already.

eddierubeiz commented 4 days ago

And here's a new ticket for the fixity report.

eddierubeiz commented 4 days ago

I just created an alert in HB which will put a note in Slack we get an H12 for any reason. Maximum one email a day; if it becomes annoying we can turn it off.