Closed pmachapman closed 5 days ago
Follow up: I re-ran the calls to getWordGraph, and they returned (the very large) results quickly.
Is there some warm up time required for SMT?
Yes, there is some warm up time (loading the model) when first calling the endpoint. The Cloudflare timeouts are probably too short for these endpoints.
As per Damien's insight, this is likely an engine lock not being released. We need to determine the best way to fix this. Options include:
It appears that one fix could be that for the DistributedReaderWriterLock.WriterLockAsync
, we could use a standard timeout for all locks and then if the timeout fails, we can log it as an error and throw an exception. This lock is used in 20+ places, all over the code. We should expect the lock to never timeout, but if it does, it will not bring the engine to a standstill and give us some breadcrumbs as to what may have failed to prevent it from continuing.
The HTTP timeout is 60 seconds, and I believe that at most there will be one timeout per call. Therefore, we could make it 55 seconds.
The lifetime is just for trying to acquire the lock, not for cancelling an existing lock. If a writer lock grabs it and holds onto it, we must assume that either (1) the process has exited in some weird way where it has not released the lock or (2) the process is hanging - not for scoped calls but for some locks that aren't scoped to an HTTP call. Moreover, if resetting the servers fixed it, it is likely that the "finally" code of clearing the lock actually happened, which means that it is (2), the process is hanging forever. Here is a proposed way to address the surface issue (processes hang and are not terminated) and figure out which thing is actually hanging.
The lifetime is the max duration of the acquired lock. Once the lock expires, other callers can acquire a lock even if it has never been released.
Confirmed it's a lock that doesn't die:
Clearing the lock didn't help - but it may be the ClearML monitor.
I got the issue again - there were no writer locks that were held onto - and it was still crashing with the timeouts (just on the cancel/delete/add endpoints). and resetting everything fixed it.
All of those endpoints try to acquire a writer lock, so it could be a reader lock that hasn't been released.
On the latest QA, for some (admittedly large) projects I am receiving 504 gateway timeout issues. These have occurred both today and late last week.
For example, calling:
Returns: