openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
7 stars 2 forks source link

Investigate leaking allocation storage data #615

Open mpass99 opened 1 month ago

mpass99 commented 1 month ago

On the 12th, we have seen 16922 objects in the nomad_allocations storage.


The case of 29-33eaa850-28a3-11ef-920d-fa163efe023e is one example of a runner that was added to this storage but never removed. The runner is used multiple times by a user and then, after the inactivity timer, destroyed. time="2024-06-12T11:08:04.171686Z" level=debug msg="Destroying Runner" destroy_reason="runner inactivity timeout exceeded" package=runner runner_id=29-33eaa850-28a3-11ef-920d-fa163efe023e

The Nomad Allocation events however don't contain any hint that the allocation got removed.

InfluxDB Allocation Events

```log 2024-06-12T10:04:49.35585745Z,map[Allocation:map[AllocModifyIndex:769789 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:]]] ClientStatus:pending CreateIndex:769789 CreateTime:1.718186688846078e+18 DesiredStatus:run EvalID:a4459036-4a27-2a99-94fd-19291c1013f1 ID:54750d38-7bb8-978c-1f0a-1ca64f1c70b4 JobID:29-33eaa850-28a3-11ef-920d-fa163efe023e Metrics:map[AllocationTime:542984 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:2] NodesEvaluated:2 NodesExhausted:0 NodesFiltered:0 NodesInPool:2 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9286548337276099 Scores:map[binpack:0.9286548337276099 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9267940453289819 Scores:map[binpack:0.9267940453289819 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:769789 ModifyTime:1.718186688846078e+18 Name:29-33eaa850-28a3-11ef-920d-fa163efe023e.default-group[0] Namespace:poseidon NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjE5OTNjNDcxLTQ3ZWQtMDlhZS1kMDI0LWQ2NTc4NzNiOThlNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTgxODY2ODgsImp0aSI6IjJlYWEwMzYxLTQyNDMtZjUxNS1mODFhLTNkZmQ5MWE3OWJhOCIsIm5iZiI6MTcxODE4NjY4OCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjU0NzUwZDM4LTdiYjgtOTc4Yy0xZjBhLTFjYTY0ZjFjNzBiNCIsIm5vbWFkX2pvYl9pZCI6IjI5LTMzZWFhODUwLTI4YTMtMTFlZi05MjBkLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS0zM2VhYTg1MC0yOGEzLTExZWYtOTIwZC1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.MKlzE5IOKYNHkUo5CgRA-7OdZXPUc1hv9h3qlvzoHyG9sYElBn1vHJeqW7qoDdRuEdlESVMPGy3LpB06s0XyPiyYiHgVnyiECEihBjkiqkRfFR8rNJTj2jYC9vubNFda2dBzjzCAGTok9ZtK9eChOFd_YqHZ8NNXnbxMh-ljAhsz24aAb_TfI2CU2WtO3IlGpTqpygZyztUoU2gHwNJ9F17p5R2sIBujFyNeP_0IrRdv3P3KPIk_jfVQGdMZGeHnQLAKVzd692UMX1wRNnD-VERayYDGIOVVRV8_XGLeqsH9M1G7EluefopxuG31SQP16NYudLcoP33IGiOCpsur6g] SigningKeyID:1993c471-47ed-09ae-d024-d657873b98e5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:]]]],payload,poseidon_nomad_events,54750d38-7bb8-978c-1f0a-1ca64f1c70b4,production,10:04:48.852712685,Allocation,PlanResult 2024-06-12T10:04:50.35542309Z,map[Allocation:map[AllocModifyIndex:769789 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:769789 CreateTime:1.718186688846078e+18 DesiredStatus:run EvalID:a4459036-4a27-2a99-94fd-19291c1013f1 ID:54750d38-7bb8-978c-1f0a-1ca64f1c70b4 JobID:29-33eaa850-28a3-11ef-920d-fa163efe023e Metrics:map[AllocationTime:542984 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:2] NodesEvaluated:2 NodesExhausted:0 NodesFiltered:0 NodesInPool:2 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9286548337276099 Scores:map[binpack:0.9286548337276099 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9267940453289819 Scores:map[binpack:0.9267940453289819 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:769793 ModifyTime:1.7181866897147538e+18 Name:29-33eaa850-28a3-11ef-920d-fa163efe023e.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjE5OTNjNDcxLTQ3ZWQtMDlhZS1kMDI0LWQ2NTc4NzNiOThlNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTgxODY2ODgsImp0aSI6IjJlYWEwMzYxLTQyNDMtZjUxNS1mODFhLTNkZmQ5MWE3OWJhOCIsIm5iZiI6MTcxODE4NjY4OCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjU0NzUwZDM4LTdiYjgtOTc4Yy0xZjBhLTFjYTY0ZjFjNzBiNCIsIm5vbWFkX2pvYl9pZCI6IjI5LTMzZWFhODUwLTI4YTMtMTFlZi05MjBkLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS0zM2VhYTg1MC0yOGEzLTExZWYtOTIwZC1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.MKlzE5IOKYNHkUo5CgRA-7OdZXPUc1hv9h3qlvzoHyG9sYElBn1vHJeqW7qoDdRuEdlESVMPGy3LpB06s0XyPiyYiHgVnyiECEihBjkiqkRfFR8rNJTj2jYC9vubNFda2dBzjzCAGTok9ZtK9eChOFd_YqHZ8NNXnbxMh-ljAhsz24aAb_TfI2CU2WtO3IlGpTqpygZyztUoU2gHwNJ9F17p5R2sIBujFyNeP_0IrRdv3P3KPIk_jfVQGdMZGeHnQLAKVzd692UMX1wRNnD-VERayYDGIOVVRV8_XGLeqsH9M1G7EluefopxuG31SQP16NYudLcoP33IGiOCpsur6g] SigningKeyID:1993c471-47ed-09ae-d024-d657873b98e5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888913754e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888980086e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866892420828e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Paused: Restarts:0 StartedAt:2024-06-12T10:04:49.242139458Z State:running TaskHandle:]]]],payload,poseidon_nomad_events,54750d38-7bb8-978c-1f0a-1ca64f1c70b4,production,10:04:49.770645469,Allocation,AllocationUpdated 2024-06-12T11:00:00.314133593Z,map[Allocation:map[AllocModifyIndex:770055 AllocatedResources:map[Shared:map[DiskMB:10 Networks: Ports:] TaskLifecycles:map[default-task:] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:] Devices: Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:769789 CreateTime:1.718186688846078e+18 DesiredStatus:run EvalID:c5cfd724-34a1-e396-c37d-6dd9545c0e36 ID:54750d38-7bb8-978c-1f0a-1ca64f1c70b4 JobID:29-33eaa850-28a3-11ef-920d-fa163efe023e Metrics:map[AllocationTime:542984 ClassExhausted: ClassFiltered: CoalescedFailures:0 ConstraintFiltered: DimensionExhausted: NodesAvailable:map[dc1:2] NodesEvaluated:2 NodesExhausted:0 NodesFiltered:0 NodesInPool:2 QuotaExhausted: ResourcesExhausted: ScoreMetaData:[map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9286548337276099 Scores:map[binpack:0.9286548337276099 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9267940453289819 Scores:map[binpack:0.9267940453289819 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:] ModifyIndex:770055 ModifyTime:1.718189999979358e+18 Name:29-33eaa850-28a3-11ef-920d-fa163efe023e.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS: InterfaceName:] NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] SharedResources:map[CPU:0 Cores:0 Devices: DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA: Networks:] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjE5OTNjNDcxLTQ3ZWQtMDlhZS1kMDI0LWQ2NTc4NzNiOThlNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTgxODY2ODgsImp0aSI6IjJlYWEwMzYxLTQyNDMtZjUxNS1mODFhLTNkZmQ5MWE3OWJhOCIsIm5iZiI6MTcxODE4NjY4OCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjU0NzUwZDM4LTdiYjgtOTc4Yy0xZjBhLTFjYTY0ZjFjNzBiNCIsIm5vbWFkX2pvYl9pZCI6IjI5LTMzZWFhODUwLTI4YTMtMTFlZi05MjBkLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS0zM2VhYTg1MC0yOGEzLTExZWYtOTIwZC1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.MKlzE5IOKYNHkUo5CgRA-7OdZXPUc1hv9h3qlvzoHyG9sYElBn1vHJeqW7qoDdRuEdlESVMPGy3LpB06s0XyPiyYiHgVnyiECEihBjkiqkRfFR8rNJTj2jYC9vubNFda2dBzjzCAGTok9ZtK9eChOFd_YqHZ8NNXnbxMh-ljAhsz24aAb_TfI2CU2WtO3IlGpTqpygZyztUoU2gHwNJ9F17p5R2sIBujFyNeP_0IrRdv3P3KPIk_jfVQGdMZGeHnQLAKVzd692UMX1wRNnD-VERayYDGIOVVRV8_XGLeqsH9M1G7EluefopxuG31SQP16NYudLcoP33IGiOCpsur6g] SigningKeyID:1993c471-47ed-09ae-d024-d657873b98e5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888913754e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888980086e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866892420828e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt: LastRestart: Paused: Restarts:0 StartedAt:2024-06-12T10:04:49.242139458Z State:running TaskHandle:]]]],payload,poseidon_nomad_events,54750d38-7bb8-978c-1f0a-1ca64f1c70b4,production,10:59:59.279810253,Allocation,PlanResult ```

Only the Job events contain the hint that the Job got deregistered.

InfluxDB Job events

```log 2024-06-12T10:04:49.35585745Z,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:769787 Meta: ModifyIndex:769787 Multiregion: Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:pending StatusDescription: Stop:false SubmitTime:1.7181866887586696e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:769787 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:769787 Policy: Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch UI: Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,10:04:48.763443928,Job,JobRegistered 2024-06-12T10:04:49.35585745Z,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:769787 Meta: ModifyIndex:769789 Multiregion: Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.7181866887586696e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:769787 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:769787 Policy: Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[used:false] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch UI: Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,10:04:48.853090004,Job,PlanResult 2024-06-12T11:00:00.314133593Z,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:770052 Meta: ModifyIndex:770052 Multiregion: Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.718189999948456e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:0 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:0 Policy: Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[timeout:180 used:true] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch UI: Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:1]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,10:59:59.258575624,Job,JobRegistered 2024-06-12T11:08:04.318388598Z,map[Job:map[Affinities: AllAtOnce:false Constraints: ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:770052 Meta: ModifyIndex:770052 Multiregion: Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob: ParentID: Payload: Periodic: Priority:50 Region:global Spreads: Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.718189999948456e+18 TaskGroups:[map[Affinities: Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect: Meta: Migrate: Name:default-group Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:0 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:0 Policy: Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services: ShutdownDelay: Spreads:[map[Attribute:${node.unique.name} SpreadTarget: Weight:100]] StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints: Consul: DispatchPayload: Driver:docker Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:default-task Resources:map[CPU:20 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:] map[Affinities: Constraints: Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect: EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect: Meta:map[timeout:180 used:true] Migrate: Name:config Networks: PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling: Services: ShutdownDelay: Spreads: StopAfterClientDisconnect: Tasks:[map[Actions: Affinities: Artifacts: CSIPluginConfig: Config:map[command:true] Constraints: Consul: DispatchPayload: Driver:exec Env: Identities: Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle: LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta: Name:config Resources:map[CPU:1 Cores:0 Devices: DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA: Networks:] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies: Schedule: Services: ShutdownDelay:0 Templates: User: Vault: VolumeMounts:]] Update: Volumes:]] Type:batch UI: Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:1]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,11:08:04.308806678,Job,JobDeregistered ```

This raises the question if the Sentry issue (See #406) can be seen as an indicator for a changed allocation id when both Nomad and Poseidon crashed in a migration. Or maybe that we ignored an important event time="2024-06-12T10:59:59.280162Z" level=debug msg="Ignoring duplicate event" allocID=54750d38-7bb8-978c-1f0a-1ca64f1c70b4 package=nomad.

This should be fixed together with #602 and #612.


Another case is 29-f6160f46-0e6b-11ef-97ca-fa163e7afdf8 on the 10th of May.