sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

100% CPU for week #484

Closed johnml1135 closed 5 days ago

johnml1135 commented 1 month ago

What caused it?

Lots of DB queries. Could be locks or hangfire monitoring or something else. Screenshot 2024-09-10 9 35 04 AM image (1)

johnml1135 commented 1 month ago

Prod_machine.locks: 95/min prod_machine_jobs.hangfire.lock: 40/min prod_serval_jobs.hangfire.lock: 40/min

johnml1135 commented 1 month ago

So, it appears that a requested word graph is making the CPU go crazy indefinitely...

image image image image image image

johnml1135 commented 1 month ago

So there is a spare lock that has been there for over a day, but not on the engine of interest. There was a 499 operation canceled (Timeout?) for GetWordGraph right before it all started.

After restarting the engines, it all went back to normal.

The locks are the ones that are doing something weird - 600 commands per second on the production locks...

What is happening?

ddaspit commented 1 month ago

From my investigation, it looks like there are a lot of commands being run, but I can't determine what the commands are. The translation engine whose call was canceled doesn't seem to exist in the database. I can't find any way that the current lock implementation would fire off so many commands. The recent (PR #486) that I fixed in the lock could have something to do with this. Without any more information, I am out of ideas.

ddaspit commented 1 month ago

I think I found one way that a lock could get in a state where it keeps hammering the database with attempts to acquire the lock. If there is:

  1. an expired reader or writer lock that hasn't been cleaned up
  2. a queued writer lock that hasn't been cleaned up

And another call tries to acquire a reader or writer lock, then it will hammer the database in a loop.

I'm not sure how 1 could happen after our recent changes. PR #486 should make it so that 2 can't happen.

ddaspit commented 1 month ago

I submitted a PR (#491) that might reduce the chances of this happening.

johnml1135 commented 5 days ago

Let's say this is resolved unless it comes back.