openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
7 stars 2 forks source link

Prewarming Pool Alert #587

Open sentry-io[bot] opened 2 months ago

sentry-io[bot] commented 2 months ago

Sentry Issue: POSEIDON-5H

Prewarming Pool Alert. Reloading environment

In this event, environment 10 (java-8) was reloaded. The event happened in the context of a deployment.

drawing

Assumed issues:

  1. Why were runners lost after the Nomad agent restarts?
  2. What happened only after 10 minutes that led to the full prewarming pool again?
  3. Why did Poseidon reload the environment even if at this point, the idle runner count matched the prewarming pool size?
MrSerth commented 2 months ago

The increase of the prewarming pool size for environment 10 was performed at 2024-05-02T08:13:18.776145Z (from 5 to 15).

mpass99 commented 1 month ago

Why were runners lost after the Nomad agent restarts?

`poseidon_nomad_idle_runners`.

Here, 3 runners get lost. ```log ,,0,2024-04-26T11:35:25.279073568Z,4,deletion,10-064a2c21-03c1-11ef-b832-fa163e7afdf8 -> Nomad agent restart. Never replaced ,,0,2024-04-26T11:35:25.279073568Z,3,deletion,10-c1b06151-03c0-11ef-b832-fa163e7afdf8 -> Nomad agent restart. Never replaced ,,0,2024-04-26T11:35:26.285209424Z,2,deletion,10-b4a430fa-03c0-11ef-b832-fa163e7afdf8 -> being used #1 ,,0,2024-04-26T11:35:26.285209424Z,3,creation,10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 -> RACE CONDITION. Creation before deletion. Why is the element count not 2? ,,0,2024-04-26T11:35:26.285209424Z,1,deletion,10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 -> Deletion falsely after the creation. ```

We see that one runner is lost due to bug #602. Two other runners are lost within the Nomad restart/deployment.

`poseidon_nomad_events`

We see that the Job is started one time correctly. After the Nomad restart it tries two further times, but fails due to an unknown reason. Then, our configured limit of 3 attempts is reached. `topic: Job` ```log # Starting ,,0,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:649000 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 Meta: ModifyIndex:649000 Multiregion: Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:pending StatusDescription: Stop:false SubmitTime:1.7141313040896648e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:649000 Enabled:true ID:f4720d9e-4215-7cd2-1927-c9406aba162c Max:300 Min:0 ModifyIndex:649000 Policy: Target:map[Group:default-group Job:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_java:8-antlr network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,10-064a2c21-03c1-11ef-b832-fa163e7afdf8,11:35:04.188051118,Job,JobRegistered ,,0,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:649000 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 Meta: ModifyIndex:649002 Multiregion: Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.7141313040896648e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:649000 Enabled:true ID:f4720d9e-4215-7cd2-1927-c9406aba162c Max:300 Min:0 ModifyIndex:649000 Policy: Target:map[Group:default-group Job:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_java:8-antlr network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,10-064a2c21-03c1-11ef-b832-fa163e7afdf8,11:35:04.285613118,Job,PlanResult # Trying ,,0,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:649000 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 Meta: ModifyIndex:649117 Multiregion: Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:pending StatusDescription: Stop:false SubmitTime:1.7141313040896648e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:649000 Enabled:true ID:f4720d9e-4215-7cd2-1927-c9406aba162c Max:300 Min:0 ModifyIndex:649000 Policy: Target:map[Group:default-group Job:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_java:8-antlr network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,10-064a2c21-03c1-11ef-b832-fa163e7afdf8,11:35:24.840341484,Job,PlanResult # Dead ,,0,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:649000 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 Meta: ModifyIndex:649325 Multiregion: Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:dead StatusDescription: Stop:false SubmitTime:1.7141313040896648e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:649000 Enabled:true ID:f4720d9e-4215-7cd2-1927-c9406aba162c Max:300 Min:0 ModifyIndex:649000 Policy: Target:map[Group:default-group Job:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_java:8-antlr network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,10-064a2c21-03c1-11ef-b832-fa163e7afdf8,11:35:25.146943614,Job,EvaluationUpdated # Trying ,,0,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:649000 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 Meta: ModifyIndex:649353 Multiregion: Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:pending StatusDescription: Stop:false SubmitTime:1.7141313040896648e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:649000 Enabled:true ID:f4720d9e-4215-7cd2-1927-c9406aba162c Max:300 Min:0 ModifyIndex:649000 Policy: Target:map[Group:default-group Job:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_java:8-antlr network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,10-064a2c21-03c1-11ef-b832-fa163e7afdf8,11:35:25.396971201,Job,EvaluationUpdated # Dead ,,0,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:649000 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 Meta: ModifyIndex:649522 Multiregion: Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:dead StatusDescription: Stop:false SubmitTime:1.7141313040896648e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:649000 Enabled:true ID:f4720d9e-4215-7cd2-1927-c9406aba162c Max:300 Min:0 ModifyIndex:649000 Policy: Target:map[Group:default-group Job:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_java:8-antlr network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,10-064a2c21-03c1-11ef-b832-fa163e7afdf8,11:35:25.878160325,Job,EvaluationUpdated ``` We see that only one Allocation is created (and stopped). Therefore, the issue mus lie in a higher level. `topic: Allocation` ```log # Started at 11:35:04 ,,0,map[Allocation:map[AllocModifyIndex:649002 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:512] Networks:]]] ClientStatus:pending CreateIndex:649002 CreateTime:1.7141313042584087e+18 DesiredStatus:run EvalID:df86ab28-d17b-44b9-eefd-242e7d269468 ID:822b87a3-e1f5-2f45-f751-9453277b12ef JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Metrics:map[AllocationTime:433166 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:ecb75941-e320-1893-4fec-0cd84d19944a NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:d67666ba-268a-d44e-af22-23d0a4e5963e NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:649002 ModifyTime:1.7141313042584087e+18 Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8.default-group[0] Namespace:poseidon NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjNmMzA5OWFlLTZlN2QtZTc2Ni1lOGZiLWI2ZWViNWMxYTFiNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTQxMzEzMDQsImp0aSI6ImJlOTc4MjVlLTVjMGQtMTNkNi01ZGUzLThjMzkyYzg3MDdhZiIsIm5iZiI6MTcxNDEzMTMwNCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjgyMmI4N2EzLWUxZjUtMmY0NS1mNzUxLTk0NTMyNzdiMTJlZiIsIm5vbWFkX2pvYl9pZCI6IjEwLTA2NGEyYzIxLTAzYzEtMTFlZi1iODMyLWZhMTYzZTdhZmRmOCIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoxMC0wNjRhMmMyMS0wM2MxLTExZWYtYjgzMi1mYTE2M2U3YWZkZjg6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.2meiZCVDIcINngbcIBUcvGrkk_ltpMWk1GqSQF1AMW46tqveBV4rSN4iYzEjZaaM34rxzbzyNYZ4WDPhQSLXRzZwWpUrk8niNeasXL2AWR5o7nJqJwNYip_yVcaoh90SwSEerdKqI8lQzR452Gvzi773aRlAO1IL8t9ZfC4pfHty-qxREhdKFukLe3ugMDs84u1-j3fLWTHwmv-1el_Wki363Uv0SeBoCDefef5Nac58DP30hQ_tgd7XtF-ea6ElufuzJmOUsqgh6ZiU9fbiMBy4Z7sqMgR-2NgcTIPdKbK7Dp-YiUYyu-ugVhpnRvJ9tOem5xkux4wmcLVwN5BCEQ] SigningKeyID:3f3099ae-6e7d-e766-e8fb-b6eeb5c1a1b5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:]]]],payload,822b87a3-e1f5-2f45-f751-9453277b12ef,11:35:04.284796261,Allocation,PlanResult ,,0,map[Allocation:map[AllocModifyIndex:649002 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:512] Networks:]]] ClientDescription:No tasks have started ClientStatus:pending CreateIndex:649002 CreateTime:1.7141313042584087e+18 DesiredStatus:run EvalID:df86ab28-d17b-44b9-eefd-242e7d269468 ID:822b87a3-e1f5-2f45-f751-9453277b12ef JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Metrics:map[AllocationTime:433166 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:ecb75941-e320-1893-4fec-0cd84d19944a NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:d67666ba-268a-d44e-af22-23d0a4e5963e NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:649007 ModifyTime:1.7141313045156447e+18 Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjNmMzA5OWFlLTZlN2QtZTc2Ni1lOGZiLWI2ZWViNWMxYTFiNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTQxMzEzMDQsImp0aSI6ImJlOTc4MjVlLTVjMGQtMTNkNi01ZGUzLThjMzkyYzg3MDdhZiIsIm5iZiI6MTcxNDEzMTMwNCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjgyMmI4N2EzLWUxZjUtMmY0NS1mNzUxLTk0NTMyNzdiMTJlZiIsIm5vbWFkX2pvYl9pZCI6IjEwLTA2NGEyYzIxLTAzYzEtMTFlZi1iODMyLWZhMTYzZTdhZmRmOCIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoxMC0wNjRhMmMyMS0wM2MxLTExZWYtYjgzMi1mYTE2M2U3YWZkZjg6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.2meiZCVDIcINngbcIBUcvGrkk_ltpMWk1GqSQF1AMW46tqveBV4rSN4iYzEjZaaM34rxzbzyNYZ4WDPhQSLXRzZwWpUrk8niNeasXL2AWR5o7nJqJwNYip_yVcaoh90SwSEerdKqI8lQzR452Gvzi773aRlAO1IL8t9ZfC4pfHty-qxREhdKFukLe3ugMDs84u1-j3fLWTHwmv-1el_Wki363Uv0SeBoCDefef5Nac58DP30hQ_tgd7XtF-ea6ElufuzJmOUsqgh6ZiU9fbiMBy4Z7sqMgR-2NgcTIPdKbK7Dp-YiUYyu-ugVhpnRvJ9tOem5xkux4wmcLVwN5BCEQ] SigningKeyID:3f3099ae-6e7d-e766-e8fb-b6eeb5c1a1b5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042685432e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042715858e+18 Type:Task Setup ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Restarts:0 StartedAt: State:pending TaskHandle:]]]],payload,822b87a3-e1f5-2f45-f751-9453277b12ef,11:35:04.651383748,Allocation,AllocationUpdated ,,0,map[Allocation:map[AllocModifyIndex:649002 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:512] Networks:]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:649002 CreateTime:1.7141313042584087e+18 DesiredStatus:run EvalID:df86ab28-d17b-44b9-eefd-242e7d269468 ID:822b87a3-e1f5-2f45-f751-9453277b12ef JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Metrics:map[AllocationTime:433166 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:ecb75941-e320-1893-4fec-0cd84d19944a NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:d67666ba-268a-d44e-af22-23d0a4e5963e NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:649008 ModifyTime:1.7141313047728558e+18 Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjNmMzA5OWFlLTZlN2QtZTc2Ni1lOGZiLWI2ZWViNWMxYTFiNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTQxMzEzMDQsImp0aSI6ImJlOTc4MjVlLTVjMGQtMTNkNi01ZGUzLThjMzkyYzg3MDdhZiIsIm5iZiI6MTcxNDEzMTMwNCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjgyMmI4N2EzLWUxZjUtMmY0NS1mNzUxLTk0NTMyNzdiMTJlZiIsIm5vbWFkX2pvYl9pZCI6IjEwLTA2NGEyYzIxLTAzYzEtMTFlZi1iODMyLWZhMTYzZTdhZmRmOCIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoxMC0wNjRhMmMyMS0wM2MxLTExZWYtYjgzMi1mYTE2M2U3YWZkZjg6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.2meiZCVDIcINngbcIBUcvGrkk_ltpMWk1GqSQF1AMW46tqveBV4rSN4iYzEjZaaM34rxzbzyNYZ4WDPhQSLXRzZwWpUrk8niNeasXL2AWR5o7nJqJwNYip_yVcaoh90SwSEerdKqI8lQzR452Gvzi773aRlAO1IL8t9ZfC4pfHty-qxREhdKFukLe3ugMDs84u1-j3fLWTHwmv-1el_Wki363Uv0SeBoCDefef5Nac58DP30hQ_tgd7XtF-ea6ElufuzJmOUsqgh6ZiU9fbiMBy4Z7sqMgR-2NgcTIPdKbK7Dp-YiUYyu-ugVhpnRvJ9tOem5xkux4wmcLVwN5BCEQ] SigningKeyID:3f3099ae-6e7d-e766-e8fb-b6eeb5c1a1b5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042685432e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042715858e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313046385912e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Restarts:0 StartedAt:2024-04-26T11:35:04.638637612Z State:running TaskHandle:]]]],payload,822b87a3-e1f5-2f45-f751-9453277b12ef,11:35:04.901619641,Allocation,AllocationUpdated # Stopped at 11:35:24 ,,0,map[Allocation:map[AllocModifyIndex:649114 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:512] Networks:]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:649002 CreateTime:1.7141313042584087e+18 DesiredStatus:run DesiredTransition:map[ForceReschedule: Migrate:true NoShutdownDelay: Reschedule:] EvalID:df86ab28-d17b-44b9-eefd-242e7d269468 ID:822b87a3-e1f5-2f45-f751-9453277b12ef JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Metrics:map[AllocationTime:433166 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:ecb75941-e320-1893-4fec-0cd84d19944a NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:d67666ba-268a-d44e-af22-23d0a4e5963e NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:649114 ModifyTime:1.7141313047728558e+18 Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjNmMzA5OWFlLTZlN2QtZTc2Ni1lOGZiLWI2ZWViNWMxYTFiNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTQxMzEzMDQsImp0aSI6ImJlOTc4MjVlLTVjMGQtMTNkNi01ZGUzLThjMzkyYzg3MDdhZiIsIm5iZiI6MTcxNDEzMTMwNCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjgyMmI4N2EzLWUxZjUtMmY0NS1mNzUxLTk0NTMyNzdiMTJlZiIsIm5vbWFkX2pvYl9pZCI6IjEwLTA2NGEyYzIxLTAzYzEtMTFlZi1iODMyLWZhMTYzZTdhZmRmOCIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoxMC0wNjRhMmMyMS0wM2MxLTExZWYtYjgzMi1mYTE2M2U3YWZkZjg6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.2meiZCVDIcINngbcIBUcvGrkk_ltpMWk1GqSQF1AMW46tqveBV4rSN4iYzEjZaaM34rxzbzyNYZ4WDPhQSLXRzZwWpUrk8niNeasXL2AWR5o7nJqJwNYip_yVcaoh90SwSEerdKqI8lQzR452Gvzi773aRlAO1IL8t9ZfC4pfHty-qxREhdKFukLe3ugMDs84u1-j3fLWTHwmv-1el_Wki363Uv0SeBoCDefef5Nac58DP30hQ_tgd7XtF-ea6ElufuzJmOUsqgh6ZiU9fbiMBy4Z7sqMgR-2NgcTIPdKbK7Dp-YiUYyu-ugVhpnRvJ9tOem5xkux4wmcLVwN5BCEQ] SigningKeyID:3f3099ae-6e7d-e766-e8fb-b6eeb5c1a1b5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042685432e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042715858e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313046385912e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Restarts:0 StartedAt:2024-04-26T11:35:04.638637612Z State:running TaskHandle:]]]],payload,822b87a3-e1f5-2f45-f751-9453277b12ef,11:35:24.776103251,Allocation,AllocationUpdateDesiredStatus ,,0,map[Allocation:map[AllocModifyIndex:649117 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:512] Networks:]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:649002 CreateTime:1.7141313042584087e+18 DesiredDescription:alloc is being migrated DesiredStatus:stop DesiredTransition:map[ForceReschedule: Migrate:true NoShutdownDelay: Reschedule:] EvalID:df86ab28-d17b-44b9-eefd-242e7d269468 ID:822b87a3-e1f5-2f45-f751-9453277b12ef JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Metrics:map[AllocationTime:433166 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:ecb75941-e320-1893-4fec-0cd84d19944a NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:d67666ba-268a-d44e-af22-23d0a4e5963e NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:649117 ModifyTime:1.7141313247122476e+18 Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjNmMzA5OWFlLTZlN2QtZTc2Ni1lOGZiLWI2ZWViNWMxYTFiNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTQxMzEzMDQsImp0aSI6ImJlOTc4MjVlLTVjMGQtMTNkNi01ZGUzLThjMzkyYzg3MDdhZiIsIm5iZiI6MTcxNDEzMTMwNCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjgyMmI4N2EzLWUxZjUtMmY0NS1mNzUxLTk0NTMyNzdiMTJlZiIsIm5vbWFkX2pvYl9pZCI6IjEwLTA2NGEyYzIxLTAzYzEtMTFlZi1iODMyLWZhMTYzZTdhZmRmOCIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoxMC0wNjRhMmMyMS0wM2MxLTExZWYtYjgzMi1mYTE2M2U3YWZkZjg6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.2meiZCVDIcINngbcIBUcvGrkk_ltpMWk1GqSQF1AMW46tqveBV4rSN4iYzEjZaaM34rxzbzyNYZ4WDPhQSLXRzZwWpUrk8niNeasXL2AWR5o7nJqJwNYip_yVcaoh90SwSEerdKqI8lQzR452Gvzi773aRlAO1IL8t9ZfC4pfHty-qxREhdKFukLe3ugMDs84u1-j3fLWTHwmv-1el_Wki363Uv0SeBoCDefef5Nac58DP30hQ_tgd7XtF-ea6ElufuzJmOUsqgh6ZiU9fbiMBy4Z7sqMgR-2NgcTIPdKbK7Dp-YiUYyu-ugVhpnRvJ9tOem5xkux4wmcLVwN5BCEQ] SigningKeyID:3f3099ae-6e7d-e766-e8fb-b6eeb5c1a1b5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042685432e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042715858e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313046385912e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Restarts:0 StartedAt:2024-04-26T11:35:04.638637612Z State:running TaskHandle:]]]],payload,822b87a3-e1f5-2f45-f751-9453277b12ef,11:35:24.839428504,Allocation,PlanResult ,,0,"map[Allocation:map[AllocModifyIndex:649117 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:512] Networks:]]] ClientDescription:All tasks have completed ClientStatus:complete CreateIndex:649002 CreateTime:1.7141313042584087e+18 DesiredDescription:alloc is being migrated DesiredStatus:stop DesiredTransition:map[ForceReschedule: Migrate:true NoShutdownDelay: Reschedule:] EvalID:df86ab28-d17b-44b9-eefd-242e7d269468 ID:822b87a3-e1f5-2f45-f751-9453277b12ef JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 Metrics:map[AllocationTime:433166 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:4] NodesEvaluated:4 NodesExhausted:0 NodesFiltered:0 NodesInPool:4 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:ecb75941-e320-1893-4fec-0cd84d19944a NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NormScore:0.9646746326438289 Scores:map[binpack:0.9646746326438289 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:d67666ba-268a-d44e-af22-23d0a4e5963e NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c NormScore:0.9627466772918187 Scores:map[binpack:0.9627466772918187 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:649316 ModifyTime:1.714131325046272e+18 Name:10-064a2c21-03c1-11ef-b832-fa163e7afdf8.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjNmMzA5OWFlLTZlN2QtZTc2Ni1lOGZiLWI2ZWViNWMxYTFiNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTQxMzEzMDQsImp0aSI6ImJlOTc4MjVlLTVjMGQtMTNkNi01ZGUzLThjMzkyYzg3MDdhZiIsIm5iZiI6MTcxNDEzMTMwNCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjgyMmI4N2EzLWUxZjUtMmY0NS1mNzUxLTk0NTMyNzdiMTJlZiIsIm5vbWFkX2pvYl9pZCI6IjEwLTA2NGEyYzIxLTAzYzEtMTFlZi1iODMyLWZhMTYzZTdhZmRmOCIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoxMC0wNjRhMmMyMS0wM2MxLTExZWYtYjgzMi1mYTE2M2U3YWZkZjg6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.2meiZCVDIcINngbcIBUcvGrkk_ltpMWk1GqSQF1AMW46tqveBV4rSN4iYzEjZaaM34rxzbzyNYZ4WDPhQSLXRzZwWpUrk8niNeasXL2AWR5o7nJqJwNYip_yVcaoh90SwSEerdKqI8lQzR452Gvzi773aRlAO1IL8t9ZfC4pfHty-qxREhdKFukLe3ugMDs84u1-j3fLWTHwmv-1el_Wki363Uv0SeBoCDefef5Nac58DP30hQ_tgd7XtF-ea6ElufuzJmOUsqgh6ZiU9fbiMBy4Z7sqMgR-2NgcTIPdKbK7Dp-YiUYyu-ugVhpnRvJ9tOem5xkux4wmcLVwN5BCEQ] SigningKeyID:3f3099ae-6e7d-e766-e8fb-b6eeb5c1a1b5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:512 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042685432e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313042715858e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313046385912e+18 Type:Started ValidationError: VaultError:] map[Details:map[kill_timeout:5s] DiskLimit:0 DisplayMessage:Sent interrupt. Waiting 5s before force killing DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:5e+09 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313247591928e+18 Type:Killing ValidationError: VaultError:] map[Details:map[exit_code:137 exit_message:Docker container exited with non-zero exit code: 137 oom_killed:false signal:0] DiskLimit:0 DisplayMessage:Exit Code: 137, Exit Message: ""Docker container exited with non-zero exit code: 137"" DownloadError: DriverError: DriverMessage: ExitCode:137 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Docker container exited with non-zero exit code: 137 RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313249188577e+18 Type:Terminated ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task successfully killed DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7141313249453348e+18 Type:Killed ValidationError: VaultError:]] Failed:false FinishedAt:2024-04-26T11:35:24.954626496Z LastRestart: Restarts:0 StartedAt:2024-04-26T11:35:04.638637612Z State:dead TaskHandle:]]]]",payload,822b87a3-e1f5-2f45-f751-9453277b12ef,11:35:25.134740675,Allocation,AllocationUpdated ``` The evaluations result in no further insights `topic: Evaluation` ```log # Starting at 11:35:04 ,,0,map[Evaluation:map[CreateIndex:649000 CreateTime:1.7141313040896648e+18 ID:df86ab28-d17b-44b9-eefd-242e7d269468 JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 ModifyIndex:649000 ModifyTime:1.7141313040896648e+18 Namespace:poseidon Priority:50 Status:pending TriggeredBy:job-register Type:batch]],payload,df86ab28-d17b-44b9-eefd-242e7d269468,11:35:04.188074894,Evaluation,JobRegistered ,,0,map[Evaluation:map[CreateIndex:649000 CreateTime:1.7141313040896648e+18 ID:df86ab28-d17b-44b9-eefd-242e7d269468 JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 ModifyIndex:649002 ModifyTime:1.7141313040896648e+18 Namespace:poseidon Priority:50 Status:pending TriggeredBy:job-register Type:batch]],payload,df86ab28-d17b-44b9-eefd-242e7d269468,11:35:04.284611749,Evaluation,PlanResult ,,0,map[Evaluation:map[CreateIndex:649000 CreateTime:1.7141313040896648e+18 ID:df86ab28-d17b-44b9-eefd-242e7d269468 JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 JobModifyIndex:649000 ModifyIndex:649004 ModifyTime:1.7141313042654973e+18 Namespace:poseidon Priority:50 QueuedAllocations:map[default-group:0] SnapshotIndex:649001 Status:complete TriggeredBy:job-register Type:batch]],payload,df86ab28-d17b-44b9-eefd-242e7d269468,11:35:04.303814000,Evaluation,EvaluationUpdated ,,0,map[Evaluation:map[CreateIndex:649057 CreateTime:1.714131314787942e+18 ID:1b9ae048-51eb-a762-fcc2-cc93ed8f743f JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649057 ModifyTime:1.714131314787942e+18 Namespace:poseidon NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeModifyIndex:649056 Priority:50 Status:pending TriggeredBy:node-update Type:batch]],payload,1b9ae048-51eb-a762-fcc2-cc93ed8f743f,11:35:14.804149546,Evaluation,EvaluationUpdated ,,0,map[Evaluation:map[CreateIndex:649057 CreateTime:1.714131314787942e+18 ID:1b9ae048-51eb-a762-fcc2-cc93ed8f743f JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649068 ModifyTime:1.7141313148146678e+18 Namespace:poseidon NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeModifyIndex:649056 Priority:50 QueuedAllocations:map[config:0 default-group:0] SnapshotIndex:649062 Status:complete TriggeredBy:node-update Type:batch]],payload,1b9ae048-51eb-a762-fcc2-cc93ed8f743f,11:35:14.826685460,Evaluation,EvaluationUpdated # Shutting down allocation ,,0,map[Evaluation:map[CreateIndex:649114 CreateTime:1.7141313246959962e+18 ID:c701780c-a6ab-cf72-f342-7ef8e3d2f201 JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649114 ModifyTime:1.7141313246959962e+18 Namespace:poseidon Priority:50 Status:pending TriggeredBy:node-drain Type:batch]],payload,c701780c-a6ab-cf72-f342-7ef8e3d2f201,11:35:24.838629562,Evaluation,AllocationUpdateDesiredStatus ,,0,map[Evaluation:map[CreateIndex:649116 CreateTime:1.7141313247089715e+18 FailedTGAllocs:map[default-group:map[AllocationTime:18595 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[] NodesEvaluated:0 NodesExhausted:0 NodesFiltered:0 NodesInPool:0 QuotaExhausted: ResourcesExhausted: ScoreMetaData: Scores:]] ID:2e4a2db6-77d1-d2ae-1c15-64a5d382897d JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649116 ModifyTime:1.7141313247089715e+18 Namespace:poseidon PreviousEval:c701780c-a6ab-cf72-f342-7ef8e3d2f201 Priority:50 SnapshotIndex:649114 Status:blocked StatusDescription:created to place remaining allocations TriggeredBy:queued-allocs Type:batch]],payload,2e4a2db6-77d1-d2ae-1c15-64a5d382897d,11:35:24.839045804,Evaluation,EvaluationUpdated ,,0,map[Evaluation:map[CreateIndex:649114 CreateTime:1.7141313246959962e+18 ID:c701780c-a6ab-cf72-f342-7ef8e3d2f201 JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649117 ModifyTime:1.7141313246959962e+18 Namespace:poseidon Priority:50 Status:pending TriggeredBy:node-drain Type:batch]],payload,c701780c-a6ab-cf72-f342-7ef8e3d2f201,11:35:24.839238591,Evaluation,PlanResult ,,0,map[Evaluation:map[BlockedEval:2e4a2db6-77d1-d2ae-1c15-64a5d382897d CreateIndex:649114 CreateTime:1.7141313246959962e+18 FailedTGAllocs:map[default-group:map[AllocationTime:18595 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[] NodesEvaluated:0 NodesExhausted:0 NodesFiltered:0 NodesInPool:0 QuotaExhausted: ResourcesExhausted: ScoreMetaData: Scores:]] ID:c701780c-a6ab-cf72-f342-7ef8e3d2f201 JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649118 ModifyTime:1.7141313247157635e+18 Namespace:poseidon Priority:50 QueuedAllocations:map[default-group:1] SnapshotIndex:649114 Status:complete TriggeredBy:node-drain Type:batch]],payload,c701780c-a6ab-cf72-f342-7ef8e3d2f201,11:35:24.840384576,Evaluation,EvaluationUpdated ,,0,map[Evaluation:map[CreateIndex:649116 CreateTime:1.7141313247089715e+18 ID:2e4a2db6-77d1-d2ae-1c15-64a5d382897d JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649325 ModifyTime:1.71413132512915e+18 Namespace:poseidon PreviousEval:c701780c-a6ab-cf72-f342-7ef8e3d2f201 Priority:50 QueuedAllocations:map[config:0 default-group:0] SnapshotIndex:649320 Status:complete TriggeredBy:queued-allocs Type:batch]],payload,2e4a2db6-77d1-d2ae-1c15-64a5d382897d,11:35:25.146751488,Evaluation,EvaluationUpdated ,,0,map[Evaluation:map[CreateIndex:649353 CreateTime:1.7141313253734008e+18 ID:82d284e3-5135-fc69-da63-6db29714dbee JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649353 ModifyTime:1.7141313253734008e+18 Namespace:poseidon NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeModifyIndex:649348 Priority:50 Status:pending TriggeredBy:node-update Type:batch]],payload,82d284e3-5135-fc69-da63-6db29714dbee,11:35:25.395744363,Evaluation,EvaluationUpdated ,,0,map[Evaluation:map[CreateIndex:649353 CreateTime:1.7141313253734008e+18 ID:82d284e3-5135-fc69-da63-6db29714dbee JobID:10-064a2c21-03c1-11ef-b832-fa163e7afdf8 ModifyIndex:649522 ModifyTime:1.7141313258614894e+18 Namespace:poseidon NodeID:3104955c-9e48-c24f-0c13-1c7a6e7e8365 NodeModifyIndex:649348 Priority:50 QueuedAllocations:map[config:0 default-group:0] SnapshotIndex:649511 Status:complete TriggeredBy:node-update Type:batch]],payload,82d284e3-5135-fc69-da63-6db29714dbee,11:35:25.877973859,Evaluation,EvaluationUpdated ```

We have two options:

  1. Try to handle Nomad failures in Poseidon: Nomad notifies Poseidon that the Job is dead (and won't be restarted anymore). Poseidon might react to this by cleaning up and requesting another runner (we have to mind infinite loops for wrong configurations here).
  2. Investigate Nomad failure: By now, we only know that Nomad failed to restart the job. The snapshot of the Nomad servers (and Nomad agent 4) might hold further reasoning for this failure.
MrSerth commented 1 month ago

Thanks! I would be glad to learn more about the "unknown" errors you've identified (regarding the poseidon_nomad_idle_runners). And regarding the poseidon_nomad_events, reading the agent's log files will hopefully clarify the situation further. Could it be that we had some issue with the Docker daemon (or the secure-bridge or so)? If I remember correctly, this was the deploy we also used to reconfigure some Daemon / secure-bridge defaults...

Also, I was wondering whether we should increase the maximum number of reattempts from 3 to something higher? It would be a pity if the problem could have been resolved by retrying more often (maybe with an increasing interval).

mpass99 commented 1 month ago

Also, I was wondering whether we should increase the maximum number of reattempts from 3 to something higher?

Thanks for shifting the focus to the reattempts!

Our current configuration is

https://github.com/openHPI/poseidon/blob/342b937695e023f28d8c7713e7df51a096c55f07/internal/environment/template-environment-job.hcl#L25-L31

The following blocks result when adding all relevant parameters

 restart { 
   attempts = 3
   delay = "0s"
   interval = "24h"
   mode = "fail"
 } 
 reschedule {
   unlimited = true 
   attempts = 0
   interval       = "24h"
   delay          = "5s"
   delay_function = "constant"
 } 
Restart and Reschedule explained

The Nomad Documentation contains detailed explanations of the [restart](https://developer.hashicorp.com/nomad/docs/job-specification/restart) and [reschedule](https://developer.hashicorp.com/nomad/docs/job-specification/reschedule) configuration. `The restart block configures a task's behavior on task failure. Restarts happen on the client that is running the task.` It restarts `attempts`-times within the `interval` each after the `delay`. When still failing, the `mode` decides what happens with the task. It either `fails` or is `delay`ed further. (We should not go with the second because in our scenario it is quite possible that wrong i.e. images are provided.) `The reschedule block specifies the group's rescheduling strategy. Nomad will attempt to schedule the allocation on another node if any of its task statuses become failed.` It will reschedule `attempts`-times (likely on a different node) within the `interval` each after the `delay` that is controlled by the `delay_function` and the `may_delay`. The `unlimited` flag `enables unlimited reschedule attempts`.

Investigate Nomad failure

Nomad Agent 4 was shutting down at 11:35:08. The cluster leader shows that the four agents became eligible at 11:35:25.340.

Details

```log Apr 26 11:35:25 nomad-server-terraform-5 nomad[11624]: 2024-04-26T11:35:25.340Z [INFO] nomad.client: node transitioning to eligible state: node_id=c8b3d968-fdd6-d09d-b6de-77d2c6fe6b9c Apr 26 11:35:25 nomad-server-terraform-5 nomad[11624]: 2024-04-26T11:35:25.341Z [INFO] nomad.client: node transitioning to eligible state: node_id=d67666ba-268a-d44e-af22-23d0a4e5963e Apr 26 11:35:25 nomad-server-terraform-5 nomad[11624]: 2024-04-26T11:35:25.344Z [INFO] nomad.client: node transitioning to eligible state: node_id=3104955c-9e48-c24f-0c13-1c7a6e7e8365 Apr 26 11:35:25 nomad-server-terraform-5 nomad[11624]: 2024-04-26T11:35:25.363Z [INFO] nomad.client: node transitioning to eligible state: node_id=ecb75941-e320-1893-4fec-0cd84d19944a ```

Working theory

At 11:35:08 the deployment led the Nomad agent to start shutting down. At 11:35:24.77 the allocation was stopped in the shutting down process (for being migrated). Because we configured a restart delay of 0s Nomad tried directly to restart the job (two times) until the restart attempts have been reached. The restart attempts fail directly as the Nomad agent is not eligible.
The job has not been rescheduled because the attempts = 0 contradicts unlimited = true and is rated higher (even if the documentation claims the opposite).

Should we verify any statement of this theory?

Improvement

For the restart configuration, we should set the delay to (at least) 1s because in this case, even this delay would bridge the time from the restart being tried and the Nomad agent being eligible again. We might reduce the interval to 1h as all errors we know (invalid image, deployment issues) lead the attempts to fail in 1h. If we keep it bigger, two errors might occur within the interval and the behavior might be influenced. We should keep mode to fail.

 restart { 
   attempts = 3
   delay = "1s"
   interval = "1h"
   mode = "fail"
 }

For the reschedule configuration (that takes place after the restarting handling failed), we do not want to reschedule unlimited because our application design is not error-prone (i.e. wrong image) and should result in an infinite loop. We should keep attempts and interval to an appropriate size (such as 3 and 24h). As you suggested we should change the delay_function to exponential. With a delay starting at 1m, we can bridge a failing time of up to 7 minutes while still focussing on a high availability of the jobs.

 reschedule {
   unlimited = false
   attempts = 3
   interval = "24h"
   delay = "1m"
   delay_function = "exponential"
 } 

How does this evaluation sound to you?

MrSerth commented 1 month ago

Working theory

Should we verify any statement of this theory?

Yes, I'd say let's verify the impact and interplay of attempts = 0 and unlimited = true to answer the question: "Are we currently rescheduling at all (unlimited) or not (attempts)?"

Improvement

For the restart configuration, we should set the delay to (at least) 1s because in this case, even this delay would bridge the time from the restart being tried and the Nomad agent being eligible again.

Why is this delay needed? I thought that an ineligible agent cannot restart at all (since a restart is performed on the agent itself)? But I am fine to specify some good delay. How would Poseidon behave if we choose something like '15s' (regarding the state of the failed allocation, and potential "rescheduling" / error handling through Poseidon)?

▶️ We assume that the error described in this ticket is caused by Nomad restarting an allocation besides of the agent being ineligible (for example, as scheduled by the Nomad server). As a Nomad agent takes about 15 seconds to shutdown (with our current config), and a short moment to start, we want to bridge this ~20 seconds. Let's go 15s.

We might reduce the interval to 1h as all errors we know (invalid image, deployment issues) lead the attempts to fail in 1h.

That's true, we can do that. However, I would say it won't make a big difference, since we are usually failing all three restart attempts in seconds anyway, aren't we?

If we keep it bigger, two errors might occur within the interval and the behavior might be influenced.

Okay, for example with two deployments within 24 hours? Yes, than let's go with 1h (which also marks my previous comment as resolved). :+1:

We should keep mode to fail.

:+1:

For the reschedule configuration (that takes place after the restarting handling failed), we do not want to reschedule unlimited because our application design is not error-prone (i.e. wrong image) and should result in an infinite loop.

Okay, fine for me.

We should keep attempts and interval to an appropriate size (such as 3 and 24h).

This ratio still looks somewhat "conservative" to me, but maybe I need to rethink it.

▶️ We discussed the various options again. By design, the interval of the rescheduling should be larger than the interval of the restart. This way, our "error escalation" works as expected by first trying to restart on the same agent before restarting on another node. We currently assume that a restart block is re-evaluated on each new (re)scheduling, thus multiplying the reschedule.attemps with restart.interval. Concluding, we should select a value larger than 3h. For now, we will go with 6h. After the rescheduling failed, Poseidon will handle further retries (through the rewarming pool alert).

As you suggested we should change the delay_function to exponential.

:+1:

With a delay starting at 1m, we can bridge a failing time of up to 7 minutes while still focussing on a high availability of the jobs.

👍 Does this knowledge influence the attempts and interval settings from above (i.e., lowering the interval)?

✅ (as resolved with the previous comment)

mpass99 commented 1 month ago

Are we currently rescheduling at all (unlimited) or not (attempts)?

When testing with an invalid image specifier, we see that per node the Job is restarted three times and is rescheduled infinitely rotating through the nodes.

Updated Working Theory

At `11:35:08` the deployment led the Nomad agent to start shutting down. At `11:35:24.77` the allocation was stopped in the shutting down process (for being migrated). When the Nomad agent drained, the Job stopped being "migrated". Nomad directly checked the other nodes for possible rescheduling, but no were eligible. Therefore, it deregistered the job. Validation: We can manually set the eligibility in the Nomad UI. When we schedule a job, set the other Nomad agents ineligible, and drain the remaining agent, we can produce similar Nomad events.

The issue with this situation is that the job is always lost, independently, of the restart and rescheduling configuration. Therefore, we might transform this investigative issue into two action point issues:

  1. Fix restart and reschedule configuration to not reschedule infinitely (e.g. due to an invalid image specifier).
  2. Catch the Nomad event that notifies Poseidon that a job is lost and will not be restarted nor rescheduled, and deal with it (log Sentry warning or request a new runner).
MrSerth commented 1 month ago

Thanks for testing and updating the working theory.

  1. Yes, go with that :+1:
  2. Catching the Nomad event in Poseidon sounds good, and I would probably suggest requesting a new runner first (ideally, a couple of times with exponential wait times) before giving up and reporting to Sentry. Regarding the implementation: You could store the number of failed requests in the metadata of the Nomad job to track failed jobs without persistency in Poseidon.
mpass99 commented 1 month ago

I would probably suggest requesting a new runner first

:+1:

Regarding the implementation: You could store the number of failed requests in the metadata of the Nomad job to track failed jobs without persistency in Poseidon.

Yeah, persistence is a point here, as in the process of the deployment also Poseidon is being restarted. As far as the situation looks to me, the Nomad Job is completely deleted in the erroneous case, which would also delete the metadata about the number of restarts.
In the case that Poseidon restart once, the lost restart counter is not that important. If Poseidon restarts multiple times, we have other problems.
After a replacement runner is successfully placed, the information would be directly invalid as it should not be counted for the next deployment issue..

MrSerth commented 1 month ago

Okay, fine. Your argumentation makes sense :+1: