Investigate and optimize solr timeouts

jrochkind commented 4 months ago

We intend to set a timeout on our app's connection to solr, to keep it from waiting indefinitely on a non-responsive solr. Becuase that happens sometime, and we don't want to tie up all our web workers waiting on non-responsive solr.

We intend to have it wait only 4 seconds for solr before giving up:

https://github.com/sciencehistory/scihist_digicoll/blob/53a1f0cd346f7456f547f37d9c5cf0852a419a1b/config/blacklight.yml#L10

Also, it is intended to retry twice, waiting only a fraction of a second between retries.

https://github.com/sciencehistory/scihist_digicoll/blob/53a1f0cd346f7456f547f37d9c5cf0852a419a1b/app/models/scihist/blacklight_solr_repository.rb#L40-L45

This means that if a Solr is being consistently unresponsive (or taking 10 seconds, or jsut more than 4, to respond), the app ought to wait: 4 seconds + ~1 second delay + 4 seconds + ~1 second delay + 4 seconds before giving up. Or around 14 seconds.

Which may already be actually far too long to tie up a worker on unresponsive solr? Maybe we want only ONE retry. (we have a retry in because in some cases SearchStax solr was being weird and a retry worked). Maybe less than 4 second timeout? (Although solr can be unpredictably slow).

But additionally, it's not clear to me that it really is working like this.

In times of Solr unresponsiveness, we are having heroku tell us lot sof responses are timing out at 30 seconds, heroku's own timeout. Suggesting they were being held up waiting for solr in sum mucb longer than the 14 seconds we expect too.

So it's possible the configuration isn't working like we want anyway -- it was a bit convoluted to try to set it up.

investigate how solr timeout/retry is actually working, with actual max time spent waiting for solr, then consider if we want to change it to keep app more reliable even when solr is not

eddierubeiz commented 4 days ago

Relevant lit

Tools

HoneyBadger's list of recent timeout errors;
- For each of these errors, you can look at each occurrence's request_id, which you can then search for in the Heroku logs. Take request 835e5b15-9fb1-449c-aa45-bb26d3d8e5bf, which I believe to be typical, as an example. Note the two four-second timeouts (12:01:46 - 12:01:50, then 12:01:50 - 12:01:54), with the server returning an error to the user immediately afterwards.
- The Heroku logs, in which a search for the phrase code=H12 gives a list of recent genuine 30-second Heroku timeouts. Our H12 errors are coming not from SearchStax, but from two other sources:
- https://digital.sciencehistory.org/admin/fixity_report, which has gotten really slow (out of scope, but we can certainly make it better if we want);
- the FAST subject-heading lookup service (which is never fast and often down).
Scout's external services report.

Findings

Our Blacklight retry code (listed above) is working exactly as it should, as far as I can tell.
During November 2024 at least, the only H12 timeouts we've had affected staff users.
I have yet to see an H12 timeout that we can attribute to SearchStax.

Suggestions Let's consider leaving the retry code as is, but changing the timeout from 4 to just 2 seconds.

jrochkind commented 4 days ago

Thanks for looking into this!

I could SWEAR I saw timeouts related to solr... but I"m again wondering if it was writing to solr. At any rate, if we can't find now, so be it. I guess if I find one again, I'll re-open, and document better!

Solr can be slow sometimes... I'm not sure how much more often we'd be giving people errors if we reduced to 2 seconds? (We DO have one retry configured for solr errors though, I think...)

jrochkind commented 4 days ago

the FAST subject-heading lookup service (which is never fast and often down).

OK, seprate ticket, but we got to get a timeout configured there which might require a PR or patch to QA, but we should do it! It should give up and assume no response long before 30 seconds!

Make another ticket at least? And maybe one for doing something abou tfixity report too? (Also we should review if any staff other than me is actually looking at it, and if we want to either encourage staff to look at it... or just remove it instead of sinking more time into it!)

eddierubeiz commented 4 days ago

We have a ticket to add a timeout to Fast already.

eddierubeiz commented 4 days ago

And here's a new ticket for the fixity report.

eddierubeiz commented 4 days ago

I just created an alert in HB which will put a note in Slack we get an H12 for any reason. Maximum one email a day; if it becomes annoying we can turn it off.

sciencehistory / scihist_digicoll

Investigate and optimize solr timeouts #2684