iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
287 stars 27 forks source link

Handle creation, listing and destruction of tasks in leo server #714

Closed tasdomas closed 1 year ago

tasdomas commented 1 year ago

This PR introduces the functionality to allocate task resources in aws.

Once the request comes in, the actual allocation is performed in the background. The user gets back a response containing the job id. That can be used to verify the status of the job.

Still having a little trouble with open api generation, will continue improving it in follow-ups.

Currently testing this functionality is a little cumbersome, but here are the basic steps:

  1. Create credentials payload:
    
    $ cat > credentials.json
    {
    "provider": "aws",
    "aws": {
    "access-key-id": "<access key>",
    "secret-access-key": "<secret>",
    "session-token": "<token, optional>"
    }
    }

$ cat credentials.json | base64 -w 0 > credentials.b64

2. Create a task request

$ cat > request.json { "image": "ubuntu", "machine": "s", "spot": false, "script": "#!/bin/sh\nls -al", "timeout": 300 }

3. start leo server

$ go run cmd/server/main.go


4. send request (using httpie)

$ http POST localhost:8080/task/ credentials:@./credentials.b64 @request.json

The response will be something like:

{ "id": "umBYvRaO" }


5. check progress of the allocation job

$ http localhost:8080/job/


6. List available tasks:

➜ temp http GET localhost:8080/task/ credentials:@./creds.b64
HTTP/1.1 200 OK Content-Length: 57 Content-Type: application/json Date: Fri, 11 Nov 2022 15:47:07 GMT

{ "tasks": [ "tpi-yearly-light-anemone-1yrn2jft-51sxxrqg" ] }


7. Destroy a specific task:

➜ temp http DELETE localhost:8080/task/tpi-yearly-light-anemone-1yrn2jft-51sxxrqg credentials:@./creds.b64 HTTP/1.1 200 OK Content-Length: 18 Content-Type: application/json Date: Fri, 11 Nov 2022 15:47:27 GMT

{ "id": "evwFZnRn" }


8. Wait for the job to finish:

➜ temp http localhost:8080/job/evwFZnRn
HTTP/1.1 200 OK Content-Length: 39 Content-Type: application/json Date: Fri, 11 Nov 2022 15:49:00 GMT

{ "id": "evwFZnRn", "status": "executing" }

➜ temp http localhost:8080/job/evwFZnRn HTTP/1.1 200 OK Content-Length: 34 Content-Type: application/json Date: Fri, 11 Nov 2022 15:49:03 GMT

{ "id": "evwFZnRn", "status": "done" }


9. List tasks again:

➜ temp http GET localhost:8080/task/ credentials:@./creds.b64 HTTP/1.1 200 OK Content-Length: 13 Content-Type: application/json Date: Fri, 11 Nov 2022 15:49:14 GMT

{ "tasks": [] }

dacbd commented 1 year ago

@tasdomas for my own awareness, in your style of PR/PR/PR -> feature branch -> main what level of code scrutiny/functionality testing do you expect from reviews?

In short

PR -> feature branch is:

feature branch -> main:

tasdomas commented 1 year ago

Honestly, it can work both ways. But I find the process of reviewing smaller PRs much more effective (YMMV), so I think that reviewing pr -> feature branch PRs makes more sense, than trying to review a large PR (feature branch -> master).

Granted, I'm using this approach (feature branches) because we don't have a develop branch (we need to talk about this).

tasdomas commented 1 year ago
  • Should server be a leo cli cmd? I don't think there's much to be gained going down that route

    • no server started logging feedback Fixed.

    • Is tailing slash sensitive? /task 404s /task/ accepts the request. Fixed, but paths are currently trailing slash sensitive.

    • timeout should not default to 0, use the same as the terraform interface? Not sure I understand what you mean

    • use the same default region as the terraform interface Made region configurable in the credentials struct.

dacbd commented 1 year ago
  • timeout should not default to 0, use the same as the terraform interface?

Not sure I understand what you mean

@tasdomas I sent a request without a timeout and the task reported it as immediately timing out and so it did nothing,

tasdomas commented 1 year ago
  • timeout should not default to 0, use the same as the terraform interface?

Not sure I understand what you mean

@tasdomas I sent a request without a timeout and the task reported it as immediately timing out and so it did nothing,

Ah, that's a good point - will update.

tasdomas commented 1 year ago
  • timeout should not default to 0, use the same as the terraform interface?

Not sure I understand what you mean

@tasdomas I sent a request without a timeout and the task reported it as immediately timing out and so it did nothing,

Ah, that's a good point - will update.

@dacbd updated the server handler with default values for region and timeout.

tasdomas commented 1 year ago
  • Create endpoint should also return the task id { "job_id": "Qwerty", "task_id": "tpi-thanksfully-xyz" }

    • should the jobs endpoint return task_id as well?

Yes, I'm working on that right now

0x2b3bfa0 commented 1 year ago

Closing as per https://github.com/iterative/leo-server/pull/2#issuecomment-1321749990