golemcloud / golem

Golem is an open source durable computing platform that makes it easy to build and deploy highly reliable distributed systems.
https://learn.golem.cloud/
Apache License 2.0
530 stars 59 forks source link

Investigate execution hanging in certain situations for worker invocation after component updates #1017

Open afsalthaj opened 1 month ago

afsalthaj commented 1 month ago

Here is what I found (and investigating myself on what's going on)

I had a shopping-cart.wasm, which I added using golem-cli Added a worker x Tried to invoke a checkout function with that worker, and see if it works, and it works. Tried to get the metadata and it works Updated the component again and bumped up the version versions, a couple of times Then I tried to update the worker with various component versions. and here is is the life cycle

golem-cli worker get --worker-name version3 --component-name shopping-cart                                         Mon 21 Oct 10:56:52 2024
{
  "workerUrn": "urn:worker:1c57afb2-41eb-4cc7-b6c1-034590a158bb/version3",
  "args": [],
  "env": {},
  "status": "Idle",
  "componentVersion": 4,
  "retryCount": 0,
  "pendingInvocationCount": 1,
  "updates": [
    {
      "type": "successfulUpdate",
      "timestamp": "2024-10-20T23:42:04.333Z",
      "targetVersion": 4
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:42:11.421Z",
      "targetVersion": 5
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:44:44.454Z",
      "targetVersion": 5
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:45:18.786Z",
      "targetVersion": 5
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:45:22.046Z",
      "targetVersion": 5
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:45:26.145Z",
      "targetVersion": 3
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:45:37.467Z",
      "targetVersion": 3
    },
    {
      "type": "pendingUpdate",
      "timestamp": "2024-10-20T23:45:52.119Z",
      "targetVersion": 5
    }
  ],
  "createdAt": "2024-10-20T23:33:16.008Z",
  "lastError": null,
  "componentSize": 115312,
  "totalLinearMemorySize": 1114112,
  "ownedResources": {}
}

Then I tried to do checkout function again, and it started producing gateway errors (probably timeouts)

golem-cli worker invoke-and-await --worker-name version3 --component-name shopping-cart --function golem:it/api.{checkout}

Unexpected http error. Code: 504, content: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.27.2</center>
</body>
</html>
afsalthaj commented 1 month ago

I was investigating myself if this something can be fixed (and thereby learn more what's going on) but, it looks like it needs dedicated time and therefore raising this ticket.?

  1. Could anyone remind me what happens (or should happen) if I update the worker with a component that doesn't exist?
  2. Likewise, what happens if I lower the version of the component for an existing worker?
  3. Is it ok, that we have pending state forever for component updates as shown in the example in the PR description?

Because these three happened in before it hangs for subsequent invocation request.

afsalthaj commented 1 month ago

I can confirm this happened again after component updates. But I think further investigation of when exactly it hangs, should be part of solving the ticket.

vigoo commented 1 month ago
  1. Could anyone remind me what happens (or should happen) if I update the worker with a component that doesn't exist?

This should be a failed update attempt and the worker should continue running on the previous version, but we have no tests for this. I suspect this is the primary problem here.

  1. Likewise, what happens if I lower the version of the component for an existing worker?

That should work, but we could have a dedicated test for it.

  1. Is it ok, that we have pending state forever for component updates as shown in the example in the PR description?

No if the worker's invocation loop works properly, it consumes the pending invocations and attempts to do them, then they either become successful or failed.