iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
289 stars 27 forks source link

Env vars and another unknown #744

Closed coyotespike closed 1 year ago

coyotespike commented 1 year ago

Hey y'all,

Thanks for this awesome tool and the fantastic blog posts introducing them.

After cloning the repo, I created variables.tf and terraform.tfvars files, as I've done for my other projects. Terraform should read environmental variable in from these files, so I put my AWS auth credentials into the tfvars file.

Running apply gave me this error:

TPI [INFO] Creation may take several minutes (consider increasing `timeout` https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#timeout). Please wait.
╷
│ Error: Request cancelled
│ 
│   with iterative_task.example-gpu,
│   on main.tf line 9, in resource "iterative_task" "example-gpu":
│    9: resource "iterative_task" "example-gpu" {
│ 
│ The plugin.(*GRPCProvider).ApplyResourceChange request was cancelled.
╵

Stack trace from the terraform-provider-iterative plugin:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x27f9d9f]

Exporting them on the command line got me much farther in the process, but then it crashed again.

TPI [INFO] Creating resources...                                                            
TPI [INFO] [1/12] Parsing PermissionSet...                                                  
TPI [INFO] [2/12] Importing DefaultVPC...                                                   
TPI [INFO] [3/12] Importing DefaultVPCSubnets...                                            
TPI [INFO] [4/12] Reading Image...                                                          
TPI [INFO] [5/12] Creating Bucket...                                                        
TPI [INFO] [6/12] Creating SecurityGroup...                                                 
TPI [INFO] [7/12] Creating KeyPair...                                                       
TPI [INFO] [8/12] Reading Credentials...                                                    
TPI [INFO] [9/12] Creating LaunchTemplate...                                                
TPI [INFO] [10/12] Creating AutoScalingGroup...                                             
TPI [INFO] [11/12] Uploading Directory...                                                   
TPI [INFO] Transferring 54.74MB (5175 files)...                                             
iterative_task.example-gpu: Still creating... [10s elapsed]
...
iterative_task.example-gpu: Still creating... [4m10s elapsed]
╷
│ Error: Plugin did not respond
│ 
│   with iterative_task.example-gpu,
│   on main.tf line 9, in resource "iterative_task" "example-gpu":
│    9: resource "iterative_task" "example-gpu" {
│ 
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may contain more details.
╵

Stack trace from the terraform-provider-iterative plugin:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x27f9d9f]

And then a very similar runtime error.

Would love any help getting the plugin to work!

dacbd commented 1 year ago

Can you provide a sample of your terraform files so that I can try and reproduce this error?

coyotespike commented 1 year ago

@dacbd Sure! All the files are in the link below. But these files are just the examples from the tutorials - there is nothing custom at all.

https://github.com/coyotespike/terraform-scripts/tree/main/cloud-gpu-training

I would assume it's an AWS config issue, except that the error is also present with GCP, and I can see some usage in AWS.

I love this idea and will transition away from Google Colab asap.

coyotespike commented 1 year ago

I just checked again and can confirm:

It seems that resources are getting created before the task errors out. However, terraform is not aware of them.

0x2b3bfa0 commented 1 year ago

@coyotespike, can you please export TF_LOG=DEBUG before running any commands and paste the output?

coyotespike commented 1 year ago

@0x2b3bfa0 Absolutely! Output is below.

I think the most helpful new info from the debug log may be the line error="rpc error: code = Unavailable desc = error reading from server: EOF"

Note that it crashes right off for Azure, it's most successful with AWS.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
yes

2023-03-14T10:51:58.131-0500 [INFO]  backend/local: apply calling Apply
2023-03-14T10:51:58.131-0500 [DEBUG] Building and walking apply graph for NormalMode plan
2023-03-14T10:51:58.131-0500 [DEBUG] Resource state not found for node "iterative_task.example", instance iterative_task.example
2023-03-14T10:51:58.131-0500 [DEBUG] ProviderTransformer: "iterative_task.example (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/iterative/iterative"]
2023-03-14T10:51:58.131-0500 [DEBUG] ProviderTransformer: "iterative_task.example" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/iterative/iterative"]
2023-03-14T10:51:58.131-0500 [DEBUG] ReferenceTransformer: "iterative_task.example (expand)" references: []
2023-03-14T10:51:58.131-0500 [DEBUG] ReferenceTransformer: "iterative_task.example" references: []
2023-03-14T10:51:58.131-0500 [DEBUG] ReferenceTransformer: "provider[\"registry.terraform.io/iterative/iterative\"]" references: []
2023-03-14T10:51:58.132-0500 [DEBUG] Starting graph walk: walkApply
2023-03-14T10:51:58.161-0500 [DEBUG] created provider logger: level=info
2023-03-14T10:51:58.161-0500 [INFO]  provider: configuring client automatic mTLS
2023-03-14T10:51:58.220-0500 [INFO]  provider.terraform-provider-iterative: configuring server automatic mTLS: timestamp=2023-03-14T10:51:58.220-0500
2023-03-14T10:51:58.256-0500 [INFO]  provider.terraform-provider-iterative: 2023/03/14 10:51:58 [WARN] Truncating attribute path of 0 diagnostics for TypeSet: timestamp=2023-03-14T10:51:58.256-0500
iterative_task.example: Creating...
2023-03-14T10:51:58.258-0500 [INFO]  Starting apply for iterative_task.example
2023-03-14T10:51:58.258-0500 [DEBUG] iterative_task.example: applying the planned Create change
2023-03-14T10:51:58.259-0500 [INFO]  provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "addresses" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
2023-03-14T10:51:58.259-0500 [INFO]  provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "status" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
2023-03-14T10:51:58.259-0500 [INFO]  provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "logs" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
2023-03-14T10:51:58.259-0500 [INFO]  provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "events" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
TPI [INFO] Creation may take several minutes (consider increasing `timeout` https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#timeout). Please wait.
TPI [INFO] Creating resources...                                                            
TPI [INFO] [1/11] Parsing PermissionSet...                                                  
TPI [INFO] [2/11] Importing DefaultVPC...                                                   
TPI [INFO] [3/11] Importing DefaultVPCSubnets...                                            
TPI [INFO] [4/11] Reading Image...                                                          
TPI [INFO] [5/11] Creating Bucket...                                                        
TPI [INFO] [6/11] Creating SecurityGroup...                                                 
TPI [INFO] [7/11] Creating KeyPair...                                                       
TPI [INFO] [8/11] Reading Credentials...                                                    
TPI [INFO] [9/11] Creating LaunchTemplate...                                                
TPI [INFO] [10/11] Creating AutoScalingGroup...                                             
iterative_task.example: Still creating... [10s elapsed]
TPI [INFO] [11/11] Starting task...                                                         
TPI [INFO] Creation completed                                                               
2023-03-14T10:52:09.101-0500 [ERROR] plugin.(*GRPCProvider).ApplyResourceChange: error="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-03-14T10:52:09.126-0500 [ERROR] vertex "iterative_task.example" error: Plugin did not respond
╷
│ Error: Plugin did not respond
│ 
│   with iterative_task.example,
│   on main.tf line 13, in resource "iterative_task" "example":
│   13: resource "iterative_task" "example" {
│ 
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may contain more details.
╵

Stack trace from the terraform-provider-iterative plugin:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x27f9d9f]

goroutine 55 [running]:
terraform-provider-iterative/iterative/utils.SystemInfo()
    terraform-provider-iterative/iterative/utils/analytics.go:72 +0x3f
terraform-provider-iterative/iterative/utils.JitsuEventPayload({0x45433d7, 0xa}, {0x0?, 0x0}, 0xc0009f3e00)
    terraform-provider-iterative/iterative/utils/analytics.go:317 +0x58
terraform-provider-iterative/iterative/utils.SendJitsuEvent({0x45433d7, 0xa}, {0x0, 0x0}, 0xc0006d42c0?)
    terraform-provider-iterative/iterative/utils/analytics.go:372 +0xf9
terraform-provider-iterative/iterative.resourceTaskCreate({0x4c8eef0, 0xc0006d8000}, 0x3f8ace0?, {0x0, 0x0})
    terraform-provider-iterative/iterative/resource_task.go:231 +0x7c5
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).create(0xc000172700, {0x4c8ee80, 0xc000a7dec0}, 0x2?, {0x0, 0x0})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.8.0/helper/schema/resource.go:330 +0x12e
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc000172700, {0x4c8ee80, 0xc000a7dec0}, 0xc000a90d00, 0xc000ab0b00, {0x0, 0x0})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.8.0/helper/schema/resource.go:456 +0x705
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc000197170, {0x4c8ee80, 0xc000a7dec0}, 0xc000ab23c0)
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.8.0/helper/schema/grpc_provider.go:977 +0xde5
github.com/hashicorp/terraform-plugin-go/tfprotov5/tf5server.(*server).ApplyResourceChange(0xc00013d600, {0x4c8ef28?, 0xc000a9bd40?}, 0x0?)
    github.com/hashicorp/terraform-plugin-go@v0.4.0/tfprotov5/tf5server/server.go:332 +0x6c
github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler({0x43d42e0?, 0xc00013d600}, {0x4c8ef28, 0xc000a9bd40}, 0xc000804b60, 0x0)
    github.com/hashicorp/terraform-plugin-go@v0.4.0/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:380 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004cc960, {0x4c98680, 0xc0002f6d00}, 0xc000a9eea0, 0xc000910cc0, 0x64e5520, 0x0)
    google.golang.org/grpc@v1.49.0/server.go:1301 +0xb2b
google.golang.org/grpc.(*Server).handleStream(0xc0004cc960, {0x4c98680, 0xc0002f6d00}, 0xc000a9eea0, 0x0)
    google.golang.org/grpc@v1.49.0/server.go:1642 +0xa2f
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    google.golang.org/grpc@v1.49.0/server.go:938 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    google.golang.org/grpc@v1.49.0/server.go:936 +0x28a

Error: The terraform-provider-iterative plugin crashed!

This is always indicative of a bug within the plugin. It would be immensely
helpful if you could report the crash with the plugin's maintainers so that it
can be fixed. The output above should help diagnose the issue.
coyotespike commented 1 year ago

So I SSH'd into a Digital Ocean Linux machine, cloned my repo and installed Terraform, and ran it.

And what do you know, it worked perfectly! Delightful in itself, and shows the problem is something to do with my local environment/OS. I may try running on another OSX machine tomorrow.

dacbd commented 1 year ago

Thanks, I had trouble getting this same exact error. note if you use the leo binary in the releases you can get slightly easier access to your scripts logs, ./leo --cloud aws --region us-west-1 read --follow tpi-xxxx

coyotespike commented 1 year ago

Thanks, good to know! I will give that a go

I ran it on an M1 OSX MBP, and it also worked.

I think it is safe to say there is something odd going on with my local environment (non-M1 OSX). I can't imagine what, I don't have a firewall or anything super weird.

coyotespike commented 1 year ago

Okay, I think I am closing in on a solution here. The jobs did not crash in non-local environments, but they did not succeed either.

On Amazon, I have no vCPUs allowed for F spot instances. This fits the behavior observed - the plugin gets all the way through creating a Bucket, SecurityGroup, KeyPair, and so on. And then fails when it actually starts the task, because there are no resources allowed.

AWS standard spot instances are A, C, D, H, I, M, R, T, Z. But as far as I can tell, TPI only fires off requests for F spot instances.

I have put in a request to up the quota/limit for spot instances.

coyotespike commented 1 year ago

@dacbd Do you happen to know how many vCPUs on AWS this request takes?

I have 64 F Spot instance vCPUs, 768 HPC vCPUs, and for standard spot instance, 256 vCPUs.

However locally the request is still failing with panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x27f9d9f]

On other computers, it fails with a different error message that indicates there are not enough spot instances.

Perhaps I haven't understood the documentation, but I don't see a way to change the type of spot instance requested.

thanks for your patience 😅

dacbd commented 1 year ago

@coyotespike can you share a snippet of what you are doing? Im not exactly sure what you are referring to?

I believe there is a default limit of 256 vCPUs from was per region on was accounts, is this what you are referring to?

0x2b3bfa0 commented 1 year ago

Closing as stale; @coyotespike, please feel free to reopen if you're still experiencing this issue.