litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.4k stars 689 forks source link

Placeholder for nodeSelectors, Tolerations and resources:{} needs to be enabled in the final workflow for experiment #4480

Open biplabMazumdar opened 7 months ago

biplabMazumdar commented 7 months ago

What is the issue? While creating the experiment , there is no way to set the spec params like nodeSelectors, Tolerations and resources:{}.

Impact of the above issue The experiment pods fails to launch due to the above constraint.

Workaround I tried modying the final workflow to pass on the spec params , but thats not working. I tried podSpecPatch , that also did not work Currently no workaround I could find.

Errors / Events 49m Normal WorkflowRunning workflow/exp2-1709391345601 Workflow Running 49m Warning WorkflowFailed workflow/exp2-1709391345601 step group deemed errored due to child exp2-1709391345601[0].install-chaos-faults error: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 49m Normal WorkflowNodeRunning workflow/exp2-1709391345601 Running node exp2-1709391345601: step group deemed errored due to child exp2-1709391345601[0].install-chaos-faults error: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 49m Warning WorkflowNodeFailed workflow/exp2-1709391345601 Failed node exp2-1709391345601: step group deemed errored due to child exp2-1709391345601[0].install-chaos-faults error: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 49m Normal WorkflowNodeRunning workflow/exp2-1709391345601 Running node exp2-1709391345601[0]: step group deemed errored due to child exp2-1709391345601[0].install-chaos-faults error: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 49m Warning WorkflowNodeError workflow/exp2-1709391345601 Error node exp2-1709391345601[0]: step group deemed errored due to child exp2-1709391345601[0].install-chaos-faults error: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 49m Normal WorkflowNodeRunning workflow/exp2-1709391345601 Running node exp2-1709391345601[0].install-chaos-faults: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 49m Warning WorkflowNodeError workflow/exp2-1709391345601 Error node exp2-1709391345601[0].install-chaos-faults: admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8sazurev3containerlimits-bf1dfd0b98bb4e52b775] container has no resource limits... 11m Warning WorkflowFailed workflow/exp2-1709393625581 invalid spec: templates.exp2.steps[0].install-chaos-faults templates.install-chaos-faults: failed to resolve {{inputs.parameters.effect1}} 3m48s Normal NotTriggerScaleUp pod/pod-cpu-hog-cwtdzo-t9j2n pod didn't trigger scale-up: 1 node(s) had untolerated taint {app: plmrs}, 1 max node group size reached 13m Normal NotTriggerScaleUp pod/pod-cpu-hog-cwtdzo-t9j2n pod didn't trigger scale-up: 1 max node group size reached, 1 node(s) had untolerated taint {app: plmrs} 6m49s Warning FailedScheduling pod/pod-cpu-hog-cwtdzo-t9j2n 0/12 nodes are available: 3 node(s) had untolerated taint {CriticalAddonsOnly: true}, 9 node(s) had untolerated taint {app: plmrs}. preemption: 0/12 nodes are available: 12 Preemption is not helpful for scheduling..

Current Modified Workflow File

kind: Workflow apiVersion: argoproj.io/v1alpha1 metadata: name: exp2 namespace: plm labels: infra_id: b50e8dc3-9f0a-4e15-bf0f-4061c3eb5c44 revision_id: f9bf2725-6bbb-43f6-b674-c4171df5b220 workflow_id: e0717cca-59f0-4efe-98a1-36b2106733b5 workflows.argoproj.io/controller-instanceid: b50e8dc3-9f0a-4e15-bf0f-4061c3eb5c44 spec: templates:

biplabMazumdar commented 4 months ago

Is this feature already available ? Can someone please comment. I was not able to set the tolerations required and hence the required litmus pods and experiment jobs were not getting scheduled.