iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
288 stars 27 forks source link

Runner Fix nvidia setup and restart #607

Closed DavidGOrtega closed 2 years ago

DavidGOrtega commented 2 years ago

This has been a tricky PR. After fixing the restart to be able to access the GPU due to kernel upgrade we were hitting a very funny error seen here and here hanging the ssh connection on DIAL and never escaping back, hence the resource timing out. To solve it I had to put the logs function within a timed-out function. we added Teleport's fix mentioned on https://github.com/iterative/terraform-provider-iterative/pull/607#issuecomment-1159793257

Related: #606

0x2b3bfa0 commented 2 years ago

As per https://github.com/golang/go/issues/15113#issuecomment-283218133, the 2 second timeout we had in place should be enough, but it apparently isn't. 😅

https://github.com/iterative/terraform-provider-iterative/blob/27d0daa84e786f0da9ba8cef6d0afd163a6c0b55/iterative/utils/ssh.go#L64

See also https://github.com/golang/go/issues/15113https://github.com/gravitational/teleport/issues/1153https://github.com/gravitational/teleport/pull/1152 for a possible solution.

DavidGOrtega commented 2 years ago

We already have a solution in place

DavidGOrtega commented 2 years ago

As per golang/go#15113 (comment), the 2 second timeout we had in place should be enough, but it apparently isn't. 😅

https://github.com/iterative/terraform-provider-iterative/blob/27d0daa84e786f0da9ba8cef6d0afd163a6c0b55/iterative/utils/ssh.go#L64

See also golang/go#15113gravitational/teleport#1153gravitational/teleport#1152 for a possible solution.

https://github.com/golang/go/issues/21941#issuecomment-346141968

From the client's perspective, one could wrap ssh.Dial in a goroutine with a buffered channel and have an app-specific timeout. We actually like this approach much better,

DavidGOrtega commented 2 years ago

@0x2b3bfa0 We are using https://github.com/gravitational/teleport/pull/1152 workaround!

https://github.com/iterative/terraform-provider-iterative/pull/607/commits/9aaac5d3349a771033402bda62be883a2c2d0c29

0x2b3bfa0 commented 2 years ago

Sorry! I missed that commit! 🙈 🚀

dacbd commented 2 years ago

@0x2b3bfa0 lets merge and release? image