usnistgov / ACVP-Server

A repository tracking releases of NIST's ACVP server. See www.github.com/usnistgov/ACVP for the protocol.
39 stars 14 forks source link

Health Route Giving 503 on Production #275

Closed AlexThurston closed 1 year ago

AlexThurston commented 1 year ago

It looks like the health route on production is returning a 503. It does return a partial body:

{
  "serverVersion": "v1.1.0.29-1",
  "details": [
    {
      "key": "testSessionProcessing",
      "description": "The TestSession internal processing load status.",
      "data": {
        "healthStatusDefinitions": {
          "Healthy": "Oldest pending TestSession is < 1 hours old.",
          "Degraded": "Oldest pending TestSession is > 1 hours old.",
          "Unhealthy": "Oldest pending TestSession is > 4 hours old."
        }
      }
    }
  ]
}

But the status key is missing

Demo seems to be OK:

{
  "status": "Healthy",
  "serverVersion": "v1.1.0.29-1",
  "details": [
    {
      "key": "testSessionProcessing",
      "status": "Healthy",
      "description": "The TestSession internal processing load status.",
      "data": {
        "healthStatusDefinitions": {
          "Healthy": "Oldest pending TestSession is < 1 hours old.",
          "Degraded": "Oldest pending TestSession is > 1 hours old.",
          "Unhealthy": "Oldest pending TestSession is > 4 hours old."
        }
      }
    }
  ]
}
livebe01 commented 1 year ago

Hi @AlexThurston, this seems to be an artifact of something going wonky on our backend (see https://github.com/usnistgov/ACVP-Server/issues/270). Thanks for reporting this. We should have this resolved shortly.

AlexThurston commented 1 year ago

Great. Thanks for the update. It had been working for the past couple of days and just start yesterday afternoon for me again. Production was also reporting degraded at the time so I wondered if they were related.

livebe01 commented 1 year ago

Everything should be back online and fully functioning now. Thanks again.

AlexThurston commented 1 year ago

Seems like it's still giving a 503 from production

unexpected status code 503 != 200: {
                   "serverVersion": "v1.1.0.29-1",
                   "details": [
                     {
                       "key": "testSessionProcessing",
                       "description": "The TestSession internal processing load status.",
                       "data": {
                         "healthStatusDefinitions": {
                           "Healthy": "Oldest pending TestSession is < 1 hours old.",
                           "Degraded": "Oldest pending TestSession is > 1 hours old.",
                           "Unhealthy": "Oldest pending TestSession is > 4 hours old."
                         }
                       }
                     }
                   ]
                 }
AlexThurston commented 1 year ago

Bahahah! Nevermind. I just tried it again and it's working.

AlexThurston commented 1 year ago

It appears as though this is happening again. 503 on the health route on production.

Hmm. Commenting on this doesn't re-open.

jbrock24 commented 1 year ago

Prod just got restarted, about 30m ago, please let me know if it's still not working for you.

livebe01 commented 1 year ago

Thanks @AlexThurston. Prod should be back and running again now.

AlexThurston commented 1 year ago

Still the same behaviour. 503s. The response does still have the body, but it's missing the status key.

Not sure if it's related, or a different thing, but demo is reporting degraded as well. However, the call is succeeding with a 200 in that case.

jbrock24 commented 1 year ago

Sorry about the issues! Demo is currently under load from a bunch of LMS submissions, hopefully that will be cleared up soon. The issue with Prod is being looked into.

jbrock24 commented 1 year ago

So, it appears we don't have enough LMS Pool values stored, we're currently looking into ways to better handle this. Thanks for the feedback!

jarnold01 commented 1 year ago

The processing issues have been resolved in both the Demo and Prod environments, though the root causes were different, along with some unfortunate timing. So everything should be operating normally again. Appreciate you commenting with your observations as well @AlexThurston ; thanks.

AlexThurston commented 9 months ago

This appears to be happening again on Production. 503s from the health route. Prod still seems to be responding to other actions. This seems to happen each time the service deployment is updated.

livebe01 commented 9 months ago

Thanks @AlexThurston. We saw this on our end as well and are working on it.

livebe01 commented 9 months ago

We think this specific instance of this issue is tied to some older hardware we're running on... should be mitigated by our pending/upcoming Prod migration.