iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4k stars 338 forks source link

Private VPC cloud runners? #1472

Open act-mreeves opened 1 month ago

act-mreeves commented 1 month ago

First of all this is going to be a very "AWS focused" comment so apologies. I was wondering if there were any plans to support private subnet runners or at least a way to specify an elastic IP.

My core issue is I want my runner to connect to our mlflow which is behind a security group that only allows certain IPs and security groups to access. I can't use complementary security groups (e.g. allow runner sg to connect to mlflow sg on port 443) because the runner ec2 is public.

I see cml runner launch uses terraform so if you can point me to the correct repo for the runner client and terraform generation code I could try to carry my own water.

Ideally I'd like to see a "private vpc" runner mode and instead of needing to use SSH to connect to the runner we could use aws ssm start-session or some other callback or api to not require direct network access over the public internet from the github actions endpoints. Is there any reason for this direct network access besides the initial health check?

0x2b3bfa0 commented 1 month ago

You can probably use cml runner launch --cloud-aws-subnet to choose a subnet in a private VPC:

https://github.com/iterative/cml/blob/5e9fcd26e9683db26a429eb5f34b989134694d4b/bin/cml/runner/launch.js#L586

See the SDK code we run here.

Your mileage may vary if you intend to use CML without a publicly reachable IP address and SSH server, but it might be possible.

act-mreeves commented 1 month ago

You can probably use cml runner launch --cloud-aws-subnet to choose a subnet in a private VPC:

https://github.com/iterative/cml/blob/5e9fcd26e9683db26a429eb5f34b989134694d4b/bin/cml/runner/launch.js#L586

See the SDK code we run here.

Your mileage may vary if you intend to use CML without a publicly reachable IP address and SSH server, but it might be possible.

I may be mistaken but I think I tried that initially and the github action never realized the machine was healthy/ready when it was on a private subnet. Instead of this success on public subnet with security group open to the world:

{"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m20s elapsed]"}
{"level":"info","message":"iterative_cml_runner.runner: Creation complete after 1m25s [id=cml-8bsk91decf-ztewkt24-3nng3tlz]"}

I got:

{"level":"info","message":"iterative_cml_runner.runner: Creation errored after 19m22s"}
{"level":"error","message":"terraform error: Error: Error checking the runner status"}

and I had terraform logging set to DEBUG:

2024-08-06T00:12:58.610Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:12:58 [TRACE] Waiting 10s before next try: timestamp=2024-08-06T00:12:58.609Z
2024-08-06T00:13:10.610Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:10 [DEBUG] Connection errors: &net.OpError{Op:"dial", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc000afa720), Err:(*net.timeoutError)(0x594aa00)}: timestamp=2024-08-06T00:13:10.610Z
2024-08-06T00:13:10.610Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:10 [TRACE] Waiting 10s before next try: timestamp=2024-08-06T00:13:10.610Z
2024-08-06T00:13:18.832Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:18 [WARN] WaitForState timeout after 19m0s: timestamp=2024-08-06T00:13:18.832Z
2024-08-06T00:13:18.832Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:18 [WARN] WaitForState starting 30s refresh grace period: timestamp=2024-08-06T00:13:18.832Z
2024-08-06T00:13:19.719Z [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
2024-08-06T00:13:19.722Z [INFO]  provider: plugin process exited: plugin=.terraform/providers/registry.terraform.io/iterative/iterative/0.11.20/linux_amd64/terraform-provider-iterative id=2186
2024-08-06T00:13:19.722Z [DEBUG] provider: plugin exited
{"level":"error","message":"terraform apply error","stack":"Error: terraform apply error\n    at Object.apply (/usr/local/lib/node_modules/@dvcorg/cml/src/terraform.js:55:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runTerraform (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:184:5)\n    at async runCloud (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:193:19)\n    at async run (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:433:14)\n    at async exports.handler (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:446:5)"}

So it seems the github action is doing something to see if the runner is ready.

0x2b3bfa0 commented 1 month ago

Yes, it is doing something, and it does require being able to reach out the EC2 instance's SSH server.

The quickest/hackiest workaround could be ignoring the exit code of cml runner launch completely and using the GitHub/GitLab/... API to wait-for/check-if the runner has registered correctly. This would eliminate the need for SSH access.