504 Gateway timeout when running GetWordGraph

pmachapman commented 3 weeks ago

On the latest QA, for some (admittedly large) projects I am receiving 504 gateway timeout issues. These have occurred both today and late last week.

For example, calling:

curl -X 'POST' \
  'https://qa.serval-api.org/api/v1/translation/engines/6667aab0db23836577801280/get-word-graph' \
  -H 'accept: application/json' \
  -H 'authorization: Bearer !!!!!!REDACTED!!!!!!!!' \
  -H 'Content-Type: application/json' \
  -d '"And the LORD God planted a garden toward the east in Eden, and there he put the man whom he had formed."'

Returns:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>

<title>qa.serval-api.org | 504: Gateway time-out</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css" />

</head>
<body>
<div id="cf-wrapper">
    <div id="cf-error-details" class="p-0">
        <header class="mx-auto pt-10 lg:pt-6 lg:px-8 w-240 lg:w-full mb-8">
            <h1 class="inline-block sm:block sm:mb-2 font-light text-60 lg:text-4xl text-black-dark leading-tight mr-2">
              <span class="inline-block">Gateway time-out</span>
              <span class="code-label">Error code 504</span>
            </h1>
            <div>
               Visit <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">cloudflare.com</a> for more information.
            </div>
            <div class="mt-3">2024-08-25 19:43:18 UTC</div>
        </header>
        <div class="my-8 bg-gradient-gray">
            <div class="w-240 lg:w-full mx-auto">
                <div class="clearfix md:px-8">

<div id="cf-browser-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">

    <span class="cf-icon-browser block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>

  </div>
  <span class="md:block w-full truncate">You</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">

    Browser

  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-cloudflare-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">
    <span class="cf-icon-cloud block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    </a>
  </div>
  <span class="md:block w-full truncate">Auckland</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">
    Cloudflare
    </a>
  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-host-status" class="cf-error-source relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">

    <span class="cf-icon-server block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-error w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>

  </div>
  <span class="md:block w-full truncate">qa.serval-api.org</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">

    Host

  </h3>
  <span class="leading-1.3 text-2xl text-red-error">Error</span>
</div>

                </div>
            </div>
        </div>

        <div class="w-240 lg:w-full mx-auto mb-8 lg:px-8">
            <div class="clearfix">
                <div class="w-1/2 md:w-full float-left pr-6 md:pb-10 md:pr-0 leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What happened?</h2>
                    <p>The web server reported a gateway time-out error.</p>
                </div>
                <div class="w-1/2 md:w-full float-left leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What can I do?</h2>
                    <p class="mb-6">Please try again in a few minutes.</p>
                </div>
            </div>
        </div>

        <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">8b8e10850a9c508c</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span id="cf-footer-item-ip" class="cf-footer-item hidden sm:block sm:mb-1">
      Your IP:
      <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>
      <span class="hidden" id="cf-footer-ip">203.211.73.132</span>
      <span class="cf-footer-separator sm:hidden">&bull;</span>
    </span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" id="brand_link" target="_blank">Cloudflare</a></span>

  </p>
  <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
</div><!-- /.error-footer -->

    </div>
</div>
</body>
</html>

pmachapman commented 3 weeks ago

Follow up: I re-ran the calls to getWordGraph, and they returned (the very large) results quickly.

Is there some warm up time required for SMT?

johnml1135 commented 3 weeks ago

https://learn.microsoft.com/en-us/answers/questions/1863887/getting-499-and-504-errors-only-in-certain-endpoin?

ddaspit commented 3 weeks ago

Yes, there is some warm up time (loading the model) when first calling the endpoint. The Cloudflare timeouts are probably too short for these endpoints.

johnml1135 commented 2 weeks ago

As per Damien's insight, this is likely an engine lock not being released. We need to determine the best way to fix this. Options include:

Setting an expiration time for the lockouts
Making sure that there is a "finally" for locks
Something else?

johnml1135 commented 2 weeks ago

It appears that one fix could be that for the DistributedReaderWriterLock.WriterLockAsync, we could use a standard timeout for all locks and then if the timeout fails, we can log it as an error and throw an exception. This lock is used in 20+ places, all over the code. We should expect the lock to never timeout, but if it does, it will not bring the engine to a standstill and give us some breadcrumbs as to what may have failed to prevent it from continuing.

The HTTP timeout is 60 seconds, and I believe that at most there will be one timeout per call. Therefore, we could make it 55 seconds.

johnml1135 commented 2 weeks ago

The lifetime is just for trying to acquire the lock, not for cancelling an existing lock. If a writer lock grabs it and holds onto it, we must assume that either (1) the process has exited in some weird way where it has not released the lock or (2) the process is hanging - not for scoped calls but for some locks that aren't scoped to an HTTP call. Moreover, if resetting the servers fixed it, it is likely that the "finally" code of clearing the lock actually happened, which means that it is (2), the process is hanging forever. Here is a proposed way to address the surface issue (processes hang and are not terminated) and figure out which thing is actually hanging.

Log the calling function with the lock as it is acquired: https://stackoverflow.com/questions/171970/how-can-i-find-the-method-that-called-the-current-method
Make a default lifetime of trying to acquire locks to 55 seconds.
If it cannot acquire the lock, grab the lock that is still being held onto (with the calling function name) and throw an error.
Optional: should we also clear that lock and then let the blocked function continue and just drive over the previous lock? @ddaspit and @Enkidu93, what do you think?

ddaspit commented 2 weeks ago

The lifetime is the max duration of the acquired lock. Once the lock expires, other callers can acquire a lock even if it has never been released.

johnml1135 commented 1 week ago

Confirmed it's a lock that doesn't die:

johnml1135 commented 1 week ago

Clearing the lock didn't help - but it may be the ClearML monitor.

johnml1135 commented 1 week ago

I got the issue again - there were no writer locks that were held onto - and it was still crashing with the timeouts (just on the cancel/delete/add endpoints). and resetting everything fixed it.

ddaspit commented 1 week ago

All of those endpoints try to acquire a writer lock, so it could be a reader lock that hasn't been released.

sillsdev / serval

504 Gateway timeout when running GetWordGraph #464