Open surya9teja opened 1 month ago
Update: Kind of found the potential culprit.
Before that, I have checked and think of potential scenarios
busy-box
to upload the files is started before the node is ready when the node is provisioned freshly. And the last one is the main cause of filesupload failed. When I add a dummy init container in flow with just delay of 10s, the file upload successfully.
My current setup is in EKS, I use Karpenter to scale my nodes up and down dynamically based on node selectors and resources required for the pods. So when the flow got trigger via SQS message it will try to create a pod which should be deployed in a specific node. That specific node is not readily available it will be provisioned by Karpenter and then pod tries to deploy in that newly provisional node. During this the init conatiner
i.e fileupload container tries to upload file way before node gets ready and the retry interval is 0.5sec with 3 retries which kind of exhausted the retries and make the whole file upload failed.
To solve this I have added a dummy init container that just runs and sleep for 10s and then proceed with fileupload conatiner to task pod.
sample flow
- id: test_pod_sqs
type: io.kestra.plugin.kubernetes.PodCreate
namespace: staging
inputFiles:
data.jsonl: "{{outputs.to_json.uri}}"
metadata:
labels:
company: microform.boa
task: boa-etl-pipeline-part-1
waitRunning: PT1H
waitUntilRunning: PT30M
spec:
initContainers:
- name: init-delay
image: busybox
command:
- "/bin/sh"
- "-c"
- |
echo 'Waiting'
sleep 10
echo 'Ready successfully'
containers:
- name: unittest
image: debian:stable-slim
command:
- cat
- "{{workingDir}}/data.jsonl"
nodeSelector:
resource-type: private-cpu
tolerations:
- key: private/cpu
operator: Exists
effect: NoSchedule
restartPolicy: Never
I am not sure how the backend of readiness check works in the init conatiner but if you can give us access to modify the fileSidecar configuration of certain things like changing sleep or no. of retries. Or Need a better way of finding readiness of node, that would be helpful but for now this trick does the job.
I hope this helps. Let me know if you want more information on this.
@loicmathieu only tagged so you can check 👍
Expected Behavior
I have a SQS trigger and when a new message flows into the queue, it will convert into
.jsonl
and pass the file uri asinputFiles
tokubernetes.PodCreate
. The file will be accessed inside the pod and processed.Actual Behaviour
When I pass the
nodeSelectors
andtolerations
to the kubernetes pod which will be deployed into different node (Not same as the kestra-worker deployed). Because of the kestra and task pod is in different node.busy-box
image is failed to upload the file that I am trying to pass it via flow.But When I removed the node selectors and toleration, the inputFile upload works fine as it intended. From my observation it is only failed if kestra and newly creating task pod not in the same node. By the way I use
Karpenter
to scale the EKS nodes up and down dynamically (Just passing the info if it is anything related to it).Steps To Reproduce
the error log for task creating pod and failing
Environment Information
Example flow