Fix k8s data transfer setup

iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes

https://registry.terraform.io/providers/iterative/iterative/latest/docs

Apache License 2.0

287 stars 27 forks source link

Fix k8s data transfer setup #669

Closed tasdomas closed 1 year ago

tasdomas commented 1 year ago

Instead of reusing (abusing) the original job, launch a separate deployment with a busybox pod and minimal requirements to facilitate data transfer.

This addresses #647 and #648

0x2b3bfa0 commented 1 year ago

This code does not guarantee that the Deployment is going to be scheduled to the same node as the subsequent Job, effectively invalidating the use of access modes other than ReadWriteMany.

Major blunder: as @tasdomas pointed out in a separate conversation, the previous implementation had exactly the same issue. 🙈

0x2b3bfa0 commented 1 year ago

Instead of separating data transfer from the main job, we should consider using a sidecar or, probably better yet, an init container as part of the main job, with the sole purpose of performing data transfer. What do you think?

tasdomas commented 1 year ago

Won't the init container be started for each job pod though?

0x2b3bfa0 commented 1 year ago

Yes, although that's a feature rather than a bug. 😄 We can use the init container for synchronization when parallelism is greater than 1. I.e. use the first init container for the actual data synchronization process, and use the others just to wait until the data copy finishes.

0x2b3bfa0 commented 1 year ago

There's still an issue, though. Copying the results back once the job finishes running still requires spinning up again a Job or a Deployment. 🤦🏼‍♂️