Open jrochkind opened 4 months ago
Relevant lit
Tools
request_id
, which you can then search for in the Heroku logs. Take request 835e5b15-9fb1-449c-aa45-bb26d3d8e5bf, which I believe to be typical, as an example.
Note the two four-second timeouts (12:01:46 - 12:01:50, then 12:01:50 - 12:01:54), with the server returning an error to the user immediately afterwards.code=H12
gives a list of recent genuine 30-second Heroku timeouts. Our H12
errors are coming not from SearchStax, but from two other sources:Findings
Suggestions Let's consider leaving the retry code as is, but changing the timeout from 4 to just 2 seconds.
Thanks for looking into this!
I could SWEAR I saw timeouts related to solr... but I"m again wondering if it was writing to solr. At any rate, if we can't find now, so be it. I guess if I find one again, I'll re-open, and document better!
Solr can be slow sometimes... I'm not sure how much more often we'd be giving people errors if we reduced to 2 seconds? (We DO have one retry configured for solr errors though, I think...)
the FAST subject-heading lookup service (which is never fast and often down).
OK, seprate ticket, but we got to get a timeout configured there which might require a PR or patch to QA, but we should do it! It should give up and assume no response long before 30 seconds!
Make another ticket at least? And maybe one for doing something abou tfixity report too? (Also we should review if any staff other than me is actually looking at it, and if we want to either encourage staff to look at it... or just remove it instead of sinking more time into it!)
We have a ticket to add a timeout to Fast already.
And here's a new ticket for the fixity report.
I just created an alert in HB which will put a note in Slack we get an H12 for any reason. Maximum one email a day; if it becomes annoying we can turn it off.
We intend to set a timeout on our app's connection to solr, to keep it from waiting indefinitely on a non-responsive solr. Becuase that happens sometime, and we don't want to tie up all our web workers waiting on non-responsive solr.
We intend to have it wait only 4 seconds for solr before giving up:
https://github.com/sciencehistory/scihist_digicoll/blob/53a1f0cd346f7456f547f37d9c5cf0852a419a1b/config/blacklight.yml#L10
Also, it is intended to retry twice, waiting only a fraction of a second between retries.
https://github.com/sciencehistory/scihist_digicoll/blob/53a1f0cd346f7456f547f37d9c5cf0852a419a1b/app/models/scihist/blacklight_solr_repository.rb#L40-L45
This means that if a Solr is being consistently unresponsive (or taking 10 seconds, or jsut more than 4, to respond), the app ought to wait: 4 seconds + ~1 second delay + 4 seconds + ~1 second delay + 4 seconds before giving up. Or around 14 seconds.
Which may already be actually far too long to tie up a worker on unresponsive solr? Maybe we want only ONE retry. (we have a retry in because in some cases SearchStax solr was being weird and a retry worked). Maybe less than 4 second timeout? (Although solr can be unpredictably slow).
But additionally, it's not clear to me that it really is working like this.
In times of Solr unresponsiveness, we are having heroku tell us lot sof responses are timing out at 30 seconds, heroku's own timeout. Suggesting they were being held up waiting for solr in sum mucb longer than the 14 seconds we expect too.
So it's possible the configuration isn't working like we want anyway -- it was a bit convoluted to try to set it up.
investigate how solr timeout/retry is actually working, with actual max time spent waiting for solr, then consider if we want to change it to keep app more reliable even when solr is not