Closed droctothorpe closed 2 years ago
We added an init container that looks something like this to get around this issue:
c.KubeClusterConfig.worker_extra_pod_config["initContainers"] = [{
"name": "wait-for-kiam",
"image": "",
"command": [
"sh",
"-c",
"for i in $(seq 1 12); do [ $i -gt 1 ] && sleep 5; aws sts get-caller-identity && s=0 && break || s=$?; done; (exit $s)"
],
"env": c.KubeClusterConfig.worker_extra_container_config.get("env", [])
}]
Janky af but it worked iirc.
I'm using KIAM on a kops-provisioned K8s cluster (v1.19.7) with https://gateway.dask.org/.
KIAM works great for us until we do an S3 read/write across a distributed Dask cluster with dozens of workers. When executing large-scale distributed operations like this, one or several of the Dask workers report a
NoCredentialsFound
error.There are no logs in the KIAM agent or server.
I'm wondering if maybe the KIAM agents are not able to keep up with the simultaneous requests from many Dask workers at the same time.
Any insight / input would be greatly appreciated.