iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
290 stars 27 forks source link

This is an initial PR for the LEO server #708

Closed tasdomas closed 1 year ago

tasdomas commented 1 year ago

The basic idea is to wrap all LEO commands in a restful API (with the exception of tailing command logs, which we can do by upgrading the connection to a websocket).

To minimize sending of sensitive cloud credentials, the client will begin its session by sending credentials to the server and receiving back a credentials key. This key will be used in subsequent requests to instruct the server to use a specific cloud credential.

Credentials themselves will be stored in-memory by the server, for security reasons they should be expired periodically.

0x2b3bfa0 commented 1 year ago

To minimize sending of sensitive cloud credentials, the client will begin its session by sending credentials to the server and receiving back a credentials key. This key will be used in subsequent requests to instruct the server to use a specific cloud credential.

Credentials themselves will be stored in-memory by the server, for security reasons they should be expired periodically.

This introduces a stateful constraint to our architecture, preventing us from balancing requests across a pool of uniform servers without additional supporting resources.

Is there any reason not to send credentials on every request? E.g. not trusting HTTPS or connection termination at the gateway? 🤔

If that's the case, we can send a public age key to clients so they can encrypt credentials before transmitting them. This will allow us to remain stateless while offering an additional layer of protection.

0x2b3bfa0 commented 1 year ago

with the exception of tailing command logs, which we can do by upgrading the connection to a websocket

We can probably live without log streaming for a first prototype, and just provide users with a big, beautiful refresh button. 😈

0x2b3bfa0 commented 1 year ago

By the way, what kind of serving model do you have in mind? I was thiking that something like AWS Lambda + SQS for long-running operations could be better than trying to keep an open HTTP connection for 2 minutes (see e.g. what Azure does).

0x2b3bfa0 commented 1 year ago

I somehow thought that this pull request was main < feature-leo-server and not feature-leo-server < d025-server-start 🤦🏼‍♂️

Marking it as a draft didn’t make any sense, sorry.

tasdomas commented 1 year ago

By the way, what kind of serving model do you have in mind? I was thiking that something like AWS Lambda + SQS for long-running operations could be better than trying to keep an open HTTP connection for 2 minutes (see e.g. what Azure does).

How is studio served? We naturally need to align with that.

0x2b3bfa0 commented 1 year ago

Studio backend is a Python (Celery, Django) application deployed to Kubernetes. 😅

tasdomas commented 1 year ago

Removed in-memory credential storage.