Closed sentry-io[bot] closed 3 days ago
Yesterday, we experienced one case of this error. Case: 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e
.
It got triggered three times when trying to update the runner's FS.
``` log 2024-08-15T13:46:55.711405 level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T13:46:56.850970 level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T13:46:56.851075 level=debug msg="Runner started" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e startupDuration=1.139349333s 2024-08-15T13:57:56.884337 level=debug msg="Ignoring duplicate event" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T13:57:56.967535 level=debug code=204 duration=140.750802ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T13:57:56.997576 level=debug code=200 duration="109.708µs" method=POST path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/execute user_agent="Faraday v2.10.1" 2024-08-15T13:57:57.014629 level=info msg="Running execution" environment_id=29 executionID=e5447508-fa80-46ca-af8e-beb8435aa858 package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T13:57:57.311864 level=info msg="Execution returned" environment_id=29 package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T13:57:57.312007 level=debug code=200 duration=297.366015ms method=GET path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/websocket user_agent= 2024-08-15T13:57:57.470215 level=debug code=200 duration=154.329755ms method=GET path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T14:00:23.730018 level=debug msg="Ignoring duplicate event" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:23.784230 level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 alloc_id=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:23.784344 level=debug msg="Runner stopped" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:23.784395 level=debug msg="Destroying Runner" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:23.784446 level=debug msg="Runner destroyed locally" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:23.784736 level=debug msg="Ignoring unknown allocation" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:24.379434 level=debug msg="Ignoring unknown allocation" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:24.861028 level=debug msg="Ignoring duplicate event" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:25.130850 level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 alloc_id=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:25.130931 level=debug msg="Runner started" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e startupDuration=1.345555751s 2024-08-15T14:00:33.514315 level=debug code=410 duration="127.303µs" method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T14:00:45.762161 level=debug msg="Ignoring duplicate event" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:45.934808 level=debug msg="Ignoring unknown allocation" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:45.935184 level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=f6be268b-a5be-5080-79be-d401f6578e94 alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 package=nomad 2024-08-15T14:00:45.935234 level=debug msg="Runner stopped" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:45.935267 level=debug msg="Destroying Runner" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:45.935298 level=debug msg="Runner destroyed locally" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:45.935968 level=debug msg="Ignoring unknown allocation" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:46.164767 level=debug msg="Ignoring unknown allocation" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:47.220762 level=debug msg="Ignoring unknown allocation" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:47.831100 level=debug msg="Ignoring duplicate event" allocID=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 package=nomad 2024-08-15T14:00:48.395508 level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation=f6be268b-a5be-5080-79be-d401f6578e94 alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 package=nomad 2024-08-15T14:00:48.395559 level=debug msg="Runner started" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e startupDuration=2.460234947s 2024-08-15T15:54:31.193740 level=warning msg="No allocation found while updateFileSystem" environment_id=29 error="communication with executor failed: nomad error during file copy: error executing command in job 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e: no allocation found" package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:54:31.194002 level=debug code=500 duration=2.270078ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T15:55:18.038129 level=warning msg="No allocation found while updateFileSystem" environment_id=29 error="communication with executor failed: nomad error during file copy: error executing command in job 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e: no allocation found" package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:55:18.038251 level=debug code=500 duration=2.710646ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T15:55:26.068262 level=warning msg="No allocation found while updateFileSystem" environment_id=29 error="communication with executor failed: nomad error during file copy: error executing command in job 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e: no allocation found" package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:55:26.068704 level=debug code=500 duration=2.745975ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T15:58:26.066865 level=debug msg="Destroying Runner" destroy_reason="runner inactivity timeout exceeded" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:58:26.068416 level=info msg="Returning runner due to inactivity timeout" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T16:38:53.335848 level=debug code=410 duration="164.613µs" method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" ```
What is interesting about the Poseidon logs is that the runner got rescheduled two times.
The log entries appear normal before the No allocation found while updateFileSystem
errors at the end.
The event is preceded by two Deployments, one at 12:37:18 UTC
and the other at 14:43:55
.
Poseidon is restarted just at the first deployment, 12:38:33
.
Unfortunately, the deployment caused an outage of our InfluxDB. From 12:48:21
to 14:01:57
Poseidon logged:
When writing to [http://our-domain/api/v2/write]: Post "http://our-domain/api/v2/write?bucket=telegraf&org=codeocean": dial tcp [ipv6]:port: connect: no route to host
This is exactly the timeframe, we would need to check if Poseidon missed some Nomad events. Now, InfluxDB contained only the information that Nomad did a JobDeregistered
at 14:04:13
.
We have not captured the belonging Allocation
event. This could either mean
The second case is more plausible because it explains Poseidon's behavior. Poseidon does not destroy the Runner but thinks it still exists.
When checking the Nomad Agent Logs, it seems that a deployment was running at 14:00:00
. Is that right?
``` logs 2024-08-15T14:00:12.812378+00:00 nomad-agent-terraform-1 systemd[1]: Starting nomad.service - Nomad... 2024-08-15T14:04:12.654964+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.654Z [INFO] client.gc: marking allocation for GC: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 2024-08-15T14:04:12.655364+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.654Z [INFO] client.gc: garbage collecting allocation: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 reason="forced collection" 2024-08-15T14:04:12.656932+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.656Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false 2024-08-15T14:04:12.730014+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.729Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false 2024-08-15T14:04:12.744249+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.744Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task type=Killed msg="Task successfully killed" failed=false 2024-08-15T14:04:12.753292+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.753Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task plugin=/usr/bin/nomad id=486517 ```
For me, it is not evident why Nomad killed this Allocation and Job. This case might support the considerations of #597 to use the topic: Job
events and of #612 to start another runner when we are informed about an unexpectedly stopped runner.
Thanks for digging into this issue and sorry for the inconvenience with the monitoring data. Let me explain what happened regarding the monitoring:
12:37:18 UTC
, I deployed the first time to include the changes from #647.12:48:02 UTC
stopped and deleted the monitoring instance, extended the volume, and started a new one.cloud-init status --wait
returned status: done
with exit code 2 (which is unusual and never happened before).cloud-init
command. This deployment was started around 13:54 UTC
and following restored operation on the monitoring system. This deployment included all servers, i.e. with Poseidon. Since no code changes were present, Poseidon wasn't restarted.14:43:55
.When checking the Nomad Agent Logs, it seems that a deployment was running at 14:00:00. Is that right?
Yes, this was the local deployment I triggered (and forgot 🙈).
While I might have been able to prove some further information on the timeline, I don't have any clue about the Allocation
event you're looking for. Do you think this error could simply be related to our deployment?
Oh, wow thanks for handling all the operations work, here!
Do you think this error could simply be related to our deployment?
Yes, however, our aspiration is to have error-free deployments 🤷 Maybe we just skip this occurrence and handle the next one where we might have more monitoring data?
Maybe we just skip this occurrence and handle the next one where we might have more monitoring data?
Okay, let's skip this occurrence for now. Let's ensure to redeploy more often (during daytimes) when merging the following PRs, so that we increase the likelihood of seeing this issue again.
We didn't notice any new occurrence, closing.
Just a few seconds ago, the issue reoccured. Most likely, it was triggered by me, since I synchronized all environments in CodeOcean after rebuilding the environments for openHPI/dockerfiles#37.
Hence, I am wondering: Is this behavior "expected" or can we improve the situation a little?
Great that we've got another occurrence to observe.
We have 3 users/runners causing the 19 errors when trying to update the FS.
14-9d697998-6b73-11ef-beaf-fa163efe023e
- From: Sep 5, 2024 12:38:30 PM UTC
- Till: Sep 5, 2024 12:40:54 PM UTC
29-b5ae0b9e-6b81-11ef-beaf-fa163efe023e
- From: Sep 5, 2024 12:38:08 PM UTC
- Till: Sep 5, 2024 12:38:33 PM UTC
29-b68e51b8-6b81-11ef-beaf-fa163efe023e
- From: Sep 5, 2024 12:37:27 PM UTC
- Till: Sep 5, 2024 12:43:46 PM UTC
First, we see that the users tried for multiple minutes to run their execution, always failing. Better if CodeOcean uses a fresh runner when it receives (multiple times) an Internal Server Error when copying files.
Regarding the Nomad Events, the three runners behave the same.
```
,,0,2024-09-05T12:30:00Z,2024-09-05T13:00:00Z,2024-09-05T12:33:23.485839047Z,map[Job:map[Affinities:
Only a Job
JobDeregistered
is sent telling Poseidon that the Job stopped. But, currently, Poseidon does not handle Job
events (it just dumps them to Influxdb).
The Allocation
event with DesiredStatus: stop
that Poseidon would handle has not been sent. Therefore, Poseidon continued to think that the Job is still existent.
We might start to listen to the Job
-JobDeregistered
-events and also stop runners based on them (Note: We should not remove the same runner twice).
Thanks for looking into this issue already; this really helps to track down potential issues.
To add more context: According to CodeOcean logs:
ExecutionEnvironmentsController#sync_all_to_runner_management
action at 2024-09-05T12:37:16.955336
.2024-09-05T12:37:24.877251
.I've also verified that CodeOcean and Poseidon uses the same time base (at least up to, incl., the seconds). Therefore, just to clarify: The timings you provided for the three runners affected are timestamps when the issue occurred (i.e. learners posting files to a non-existent runner), right?
First, we see that the users tried for multiple minutes to run their execution, always failing.
Ah, yes (see my previous comment).
Better if CodeOcean uses a fresh runner when it receives (multiple times) an Internal Server Error when copying files.
We handle the case where the runner is non-existent and properly reported by Poseidon through a 410 error:
https://github.com/openHPI/codeocean/blob/6a0c4976baf24b02e659145e912c494fe05b6557/lib/runner/strategy/poseidon.rb#L291-L292 https://github.com/openHPI/codeocean/blob/c4e819df46a220dd6f59158cfde86a8779deffb9/app/models/runner.rb#L46-L55
Other status codes (or an internal server error) are currently not causing a request for a new runner. Do you feel we should catch more errors in CodeOcean and/or handle the error better in Poseidon to return a proper 410 response? Both might make sense, I'd say 🙂. My proposal is presented at https://github.com/openHPI/codeocean/pull/2511, but I would still suggest to fix the Poseidon error, too.
The timings you provided for the three runners affected are timestamps when the issue occurred (i.e. learners posting files to a non-existent runner), right?
Yes
Both might make sense, I'd say 🙂
I agree. Thanks already for your proposal. In #682 you can find an approach for JobDeregistered
handling.
Sentry Issue: POSEIDON-G