Closed johnml1135 closed 5 days ago
Prod_machine.locks: 95/min prod_machine_jobs.hangfire.lock: 40/min prod_serval_jobs.hangfire.lock: 40/min
So, it appears that a requested word graph is making the CPU go crazy indefinitely...
So there is a spare lock that has been there for over a day, but not on the engine of interest. There was a 499 operation canceled (Timeout?) for GetWordGraph right before it all started.
After restarting the engines, it all went back to normal.
The locks are the ones that are doing something weird - 600 commands per second on the production locks...
What is happening?
From my investigation, it looks like there are a lot of commands being run, but I can't determine what the commands are. The translation engine whose call was canceled doesn't seem to exist in the database. I can't find any way that the current lock implementation would fire off so many commands. The recent (PR #486) that I fixed in the lock could have something to do with this. Without any more information, I am out of ideas.
I think I found one way that a lock could get in a state where it keeps hammering the database with attempts to acquire the lock. If there is:
And another call tries to acquire a reader or writer lock, then it will hammer the database in a loop.
I'm not sure how 1 could happen after our recent changes. PR #486 should make it so that 2 can't happen.
I submitted a PR (#491) that might reduce the chances of this happening.
Let's say this is resolved unless it comes back.
What caused it?
Lots of DB queries. Could be locks or hangfire monitoring or something else.