openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

No allocation found while updateFileSystem #649

Closed sentry-io[bot] closed 3 days ago

sentry-io[bot] commented 3 weeks ago

Sentry Issue: POSEIDON-G

No allocation found while updateFileSystem
mpass99 commented 3 weeks ago

Yesterday, we experienced one case of this error. Case: 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e. It got triggered three times when trying to update the runner's FS.

Poseidon Logs

``` log 2024-08-15T13:46:55.711405 level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T13:46:56.850970 level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T13:46:56.851075 level=debug msg="Runner started" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e startupDuration=1.139349333s 2024-08-15T13:57:56.884337 level=debug msg="Ignoring duplicate event" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T13:57:56.967535 level=debug code=204 duration=140.750802ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T13:57:56.997576 level=debug code=200 duration="109.708µs" method=POST path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/execute user_agent="Faraday v2.10.1" 2024-08-15T13:57:57.014629 level=info msg="Running execution" environment_id=29 executionID=e5447508-fa80-46ca-af8e-beb8435aa858 package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T13:57:57.311864 level=info msg="Execution returned" environment_id=29 package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T13:57:57.312007 level=debug code=200 duration=297.366015ms method=GET path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/websocket user_agent= 2024-08-15T13:57:57.470215 level=debug code=200 duration=154.329755ms method=GET path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T14:00:23.730018 level=debug msg="Ignoring duplicate event" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:23.784230 level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 alloc_id=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:23.784344 level=debug msg="Runner stopped" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:23.784395 level=debug msg="Destroying Runner" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:23.784446 level=debug msg="Runner destroyed locally" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:23.784736 level=debug msg="Ignoring unknown allocation" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:24.379434 level=debug msg="Ignoring unknown allocation" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:24.861028 level=debug msg="Ignoring duplicate event" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:25.130850 level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 alloc_id=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:25.130931 level=debug msg="Runner started" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e startupDuration=1.345555751s 2024-08-15T14:00:33.514315 level=debug code=410 duration="127.303µs" method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T14:00:45.762161 level=debug msg="Ignoring duplicate event" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:45.934808 level=debug msg="Ignoring unknown allocation" allocID=5b7486d0-88ba-df2b-5d2b-3b62711f13c4 package=nomad 2024-08-15T14:00:45.935184 level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=f6be268b-a5be-5080-79be-d401f6578e94 alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 package=nomad 2024-08-15T14:00:45.935234 level=debug msg="Runner stopped" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:45.935267 level=debug msg="Destroying Runner" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:45.935298 level=debug msg="Runner destroyed locally" destroy_reason="the allocation was rescheduled: the destruction should not cause external changes" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T14:00:45.935968 level=debug msg="Ignoring unknown allocation" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:46.164767 level=debug msg="Ignoring unknown allocation" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:47.220762 level=debug msg="Ignoring unknown allocation" allocID=f6be268b-a5be-5080-79be-d401f6578e94 package=nomad 2024-08-15T14:00:47.831100 level=debug msg="Ignoring duplicate event" allocID=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 package=nomad 2024-08-15T14:00:48.395508 level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation=f6be268b-a5be-5080-79be-d401f6578e94 alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 package=nomad 2024-08-15T14:00:48.395559 level=debug msg="Runner started" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e startupDuration=2.460234947s 2024-08-15T15:54:31.193740 level=warning msg="No allocation found while updateFileSystem" environment_id=29 error="communication with executor failed: nomad error during file copy: error executing command in job 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e: no allocation found" package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:54:31.194002 level=debug code=500 duration=2.270078ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T15:55:18.038129 level=warning msg="No allocation found while updateFileSystem" environment_id=29 error="communication with executor failed: nomad error during file copy: error executing command in job 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e: no allocation found" package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:55:18.038251 level=debug code=500 duration=2.710646ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T15:55:26.068262 level=warning msg="No allocation found while updateFileSystem" environment_id=29 error="communication with executor failed: nomad error during file copy: error executing command in job 29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e: no allocation found" package=api runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:55:26.068704 level=debug code=500 duration=2.745975ms method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" 2024-08-15T15:58:26.066865 level=debug msg="Destroying Runner" destroy_reason="runner inactivity timeout exceeded" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T15:58:26.068416 level=info msg="Returning runner due to inactivity timeout" package=runner runner_id=29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e 2024-08-15T16:38:53.335848 level=debug code=410 duration="164.613µs" method=PATCH path=/api/v1/runners/29-d5c6c8f5-5b0c-11ef-863e-fa163efe023e/files user_agent="Faraday v2.10.1" ```

What is interesting about the Poseidon logs is that the runner got rescheduled two times. The log entries appear normal before the No allocation found while updateFileSystem errors at the end.

The event is preceded by two Deployments, one at 12:37:18 UTC and the other at 14:43:55. Poseidon is restarted just at the first deployment, 12:38:33.

Unfortunately, the deployment caused an outage of our InfluxDB. From 12:48:21 to 14:01:57 Poseidon logged:

When writing to [http://our-domain/api/v2/write]: Post "http://our-domain/api/v2/write?bucket=telegraf&org=codeocean": dial tcp [ipv6]:port: connect: no route to host

This is exactly the timeframe, we would need to check if Poseidon missed some Nomad events. Now, InfluxDB contained only the information that Nomad did a JobDeregistered at 14:04:13. We have not captured the belonging Allocation event. This could either mean

The second case is more plausible because it explains Poseidon's behavior. Poseidon does not destroy the Runner but thinks it still exists.

When checking the Nomad Agent Logs, it seems that a deployment was running at 14:00:00. Is that right?

Nomad Agent 1 Logs

``` logs 2024-08-15T14:00:12.812378+00:00 nomad-agent-terraform-1 systemd[1]: Starting nomad.service - Nomad... 2024-08-15T14:04:12.654964+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.654Z [INFO] client.gc: marking allocation for GC: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 2024-08-15T14:04:12.655364+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.654Z [INFO] client.gc: garbage collecting allocation: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 reason="forced collection" 2024-08-15T14:04:12.656932+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.656Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false 2024-08-15T14:04:12.730014+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.729Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false 2024-08-15T14:04:12.744249+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.744Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task type=Killed msg="Task successfully killed" failed=false 2024-08-15T14:04:12.753292+00:00 nomad-agent-terraform-1 nomad[485343]: 2024-08-15T14:04:12.753Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=9901b9ec-b4ef-3f00-16ab-c2e7ceecc3d8 task=default-task plugin=/usr/bin/nomad id=486517 ```

For me, it is not evident why Nomad killed this Allocation and Job. This case might support the considerations of #597 to use the topic: Job events and of #612 to start another runner when we are informed about an unexpectedly stopped runner.

MrSerth commented 3 weeks ago

Thanks for digging into this issue and sorry for the inconvenience with the monitoring data. Let me explain what happened regarding the monitoring:

  1. At 12:37:18 UTC, I deployed the first time to include the changes from #647.
  2. Shortly after that, I noticed a warning in Icinga about the disk size (80% used) of our monitoring host. Hence, I increased the volume from 150 GB to 200GB.
  3. The deployment triggered at 12:48:02 UTC stopped and deleted the monitoring instance, extended the volume, and started a new one.
  4. However, that deployment later failed, since cloud-init status --wait returned status: done with exit code 2 (which is unusual and never happened before).
  5. After investigating the issue, I started a local deployment and skipped the respective cloud-init command. This deployment was started around 13:54 UTC and following restored operation on the monitoring system. This deployment included all servers, i.e. with Poseidon. Since no code changes were present, Poseidon wasn't restarted.
  6. To get the GitLab CI status green again, I actually triggered a third deployment (there) 🤓. This is the one triggered at 14:43:55.

When checking the Nomad Agent Logs, it seems that a deployment was running at 14:00:00. Is that right?

Yes, this was the local deployment I triggered (and forgot 🙈).

While I might have been able to prove some further information on the timeline, I don't have any clue about the Allocation event you're looking for. Do you think this error could simply be related to our deployment?

mpass99 commented 3 weeks ago

Oh, wow thanks for handling all the operations work, here!

Do you think this error could simply be related to our deployment?

Yes, however, our aspiration is to have error-free deployments 🤷 Maybe we just skip this occurrence and handle the next one where we might have more monitoring data?

MrSerth commented 3 weeks ago

Maybe we just skip this occurrence and handle the next one where we might have more monitoring data?

Okay, let's skip this occurrence for now. Let's ensure to redeploy more often (during daytimes) when merging the following PRs, so that we increase the likelihood of seeing this issue again.

MrSerth commented 1 week ago

We didn't notice any new occurrence, closing.

MrSerth commented 1 week ago

Just a few seconds ago, the issue reoccured. Most likely, it was triggered by me, since I synchronized all environments in CodeOcean after rebuilding the environments for openHPI/dockerfiles#37.

Hence, I am wondering: Is this behavior "expected" or can we improve the situation a little?

mpass99 commented 1 week ago

Great that we've got another occurrence to observe.

We have 3 users/runners causing the 19 errors when trying to update the FS.

14-9d697998-6b73-11ef-beaf-fa163efe023e
- From: Sep 5, 2024 12:38:30 PM UTC
- Till: Sep 5, 2024 12:40:54 PM UTC
29-b5ae0b9e-6b81-11ef-beaf-fa163efe023e
- From: Sep 5, 2024 12:38:08 PM UTC
- Till: Sep 5, 2024 12:38:33 PM UTC
29-b68e51b8-6b81-11ef-beaf-fa163efe023e
- From: Sep 5, 2024 12:37:27 PM UTC
- Till: Sep 5, 2024 12:43:46 PM UTC

First, we see that the users tried for multiple minutes to run their execution, always failing. Better if CodeOcean uses a fresh runner when it receives (multiple times) an Internal Server Error when copying files.

Regarding the Nomad Events, the three runners behave the same.

Nomad Events

``` ,,0,2024-09-05T12:30:00Z,2024-09-05T13:00:00Z,2024-09-05T12:33:23.485839047Z,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:1.495363e+06 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-b68e51b8-6b81-11ef-beaf-fa163efe023e JobModifyIndex:1.495469e+06 Meta: ModifyIndex:1.495469e+06 Multiregion: Name:29-b68e51b8-6b81-11ef-beaf-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.7255396025520184e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:3 Delay:6e+10 DelayFunction:exponential Interval:2.16e+13 MaxDelay:2.4e+11 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:3.6e+12 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks: SecretsMB:0] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:3.6e+12 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[timeout:180 used:true] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks: SecretsMB:0] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch UI: Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:1]],payload,poseidon_nomad_events,29-b68e51b8-6b81-11ef-beaf-fa163efe023e,production,12:33:22.561071557,Job,JobRegistered ,,0,2024-09-05T12:30:00Z,2024-09-05T13:00:00Z,2024-09-05T12:33:23.485839047Z,map[Allocation:map[AllocModifyIndex:1.495471e+06 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:1.495365e+06 CreateTime:1.7255390330405724e+18 DesiredStatus:run EvalID:71046a52-5114-1c0d-b12c-f6aaf63308bc ID:2dfb35e1-3ae7-d68e-af35-18a2c05ebdf8 JobID:29-b68e51b8-6b81-11ef-beaf-fa163efe023e Metrics:map[AllocationTime:712275 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:47501b8a-44c5-4980-b2b9-aee0719764e9 NormScore:0.9608244562182952 Scores:map[binpack:0.9608244562182952 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:15fa2016-f691-b972-4469-6ede26812a64 NormScore:0.9608244562182952 Scores:map[binpack:0.9608244562182952 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9608244562182952 Scores:map[binpack:0.9608244562182952 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9608244562182952 Scores:map[binpack:0.9608244562182952 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:1.495471e+06 ModifyTime:1.7255396027274028e+18 Name:29-b68e51b8-6b81-11ef-beaf-fa163efe023e.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks: SecretsMB:0] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks: SecretsMB:0] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjgzZTcyNThlLWFkNDktNTY0Ny0xNDljLWQzZWUxYzA3NzlmOCIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MjU1MzkwMzMsImp0aSI6IjJmNTQxYmJhLWFmNzctMzk1NS03NjZjLTE3MTAxZTI4NzY0YiIsIm5iZiI6MTcyNTUzOTAzMywibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjJkZmIzNWUxLTNhZTctZDY4ZS1hZjM1LTE4YTJjMDVlYmRmOCIsIm5vbWFkX2pvYl9pZCI6IjI5LWI2OGU1MWI4LTZiODEtMTFlZi1iZWFmLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS1iNjhlNTFiOC02YjgxLTExZWYtYmVhZi1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.CPBdpE-hNT6O-D7aybhsf7XaQGLcUISIrntZPlZbXQE20ed_9gFPOfiOYmCfrtlI_Kvjklk5Wm0lkVLnHxjHO5LqL22qw2YOioWtTyM_Uf1V1_Zqh78m7YlXn0YERdkMPUjt7HKTy9LcuBaggGz55tlTilah5cfE_OS6lmy5xbS9XHb-9ambIw-OcomUD8szPEH607bxi2tDhiOHEKEgZWOGMzF4CRRKt8hzxIWlCkAe_i8dRjjrH4epJczhrhi-7BUFcs4r_pDdbe20RMGqTbY0k9wtRMRwTfrdsQK7WLgREvUEY82nckODQHYCS23m9ogU0Yhy3k_61wKiOijpyg] SigningKeyID:83e7258e-ad49-5647-149c-d3ee1c0779f8 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks: SecretsMB:0]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7255390330523681e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7255390330567805e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7255390333553462e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Paused: Restarts:0 StartedAt:2024-09-05T12:23:53.35538723Z State:running TaskHandle:]]]],payload,poseidon_nomad_events,2dfb35e1-3ae7-d68e-af35-18a2c05ebdf8,production,12:33:22.750332479,Allocation,PlanResult ,,0,2024-09-05T12:30:00Z,2024-09-05T13:00:00Z,2024-09-05T12:37:23.488286265Z,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:1.495363e+06 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-b68e51b8-6b81-11ef-beaf-fa163efe023e JobModifyIndex:1.495469e+06 Meta: ModifyIndex:1.495469e+06 Multiregion: Name:29-b68e51b8-6b81-11ef-beaf-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.7255396025520184e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:3 Delay:6e+10 DelayFunction:exponential Interval:2.16e+13 MaxDelay:2.4e+11 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:3.6e+12 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks: SecretsMB:0] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:3.6e+12 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[timeout:180 used:true] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks: SecretsMB:0] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch UI: Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:1]],payload,poseidon_nomad_events,29-b68e51b8-6b81-11ef-beaf-fa163efe023e,production,12:37:23.196523085,Job,JobDeregistered ```

Only a Job JobDeregistered is sent telling Poseidon that the Job stopped. But, currently, Poseidon does not handle Job events (it just dumps them to Influxdb). The Allocation event with DesiredStatus: stop that Poseidon would handle has not been sent. Therefore, Poseidon continued to think that the Job is still existent.

We might start to listen to the Job-JobDeregistered-events and also stop runners based on them (Note: We should not remove the same runner twice).

MrSerth commented 1 week ago

Thanks for looking into this issue already; this really helps to track down potential issues.

To add more context: According to CodeOcean logs:

I've also verified that CodeOcean and Poseidon uses the same time base (at least up to, incl., the seconds). Therefore, just to clarify: The timings you provided for the three runners affected are timestamps when the issue occurred (i.e. learners posting files to a non-existent runner), right?

First, we see that the users tried for multiple minutes to run their execution, always failing.

Ah, yes (see my previous comment).

Better if CodeOcean uses a fresh runner when it receives (multiple times) an Internal Server Error when copying files.

We handle the case where the runner is non-existent and properly reported by Poseidon through a 410 error:

https://github.com/openHPI/codeocean/blob/6a0c4976baf24b02e659145e912c494fe05b6557/lib/runner/strategy/poseidon.rb#L291-L292 https://github.com/openHPI/codeocean/blob/c4e819df46a220dd6f59158cfde86a8779deffb9/app/models/runner.rb#L46-L55

Other status codes (or an internal server error) are currently not causing a request for a new runner. Do you feel we should catch more errors in CodeOcean and/or handle the error better in Poseidon to return a proper 410 response? Both might make sense, I'd say 🙂. My proposal is presented at https://github.com/openHPI/codeocean/pull/2511, but I would still suggest to fix the Poseidon error, too.

mpass99 commented 4 days ago

The timings you provided for the three runners affected are timestamps when the issue occurred (i.e. learners posting files to a non-existent runner), right?

Yes

Both might make sense, I'd say 🙂

I agree. Thanks already for your proposal. In #682 you can find an approach for JobDeregistered handling.