Closed coyotespike closed 1 year ago
Can you provide a sample of your terraform files so that I can try and reproduce this error?
@dacbd Sure! All the files are in the link below. But these files are just the examples from the tutorials - there is nothing custom at all.
https://github.com/coyotespike/terraform-scripts/tree/main/cloud-gpu-training
I would assume it's an AWS config issue, except that the error is also present with GCP, and I can see some usage in AWS.
I love this idea and will transition away from Google Colab asap.
I just checked again and can confirm:
It seems that resources are getting created before the task errors out. However, terraform is not aware of them.
@coyotespike, can you please export TF_LOG=DEBUG
before running any commands and paste the output?
@0x2b3bfa0 Absolutely! Output is below.
I think the most helpful new info from the debug log may be the line error="rpc error: code = Unavailable desc = error reading from server: EOF"
Note that it crashes right off for Azure, it's most successful with AWS.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
yes
2023-03-14T10:51:58.131-0500 [INFO] backend/local: apply calling Apply
2023-03-14T10:51:58.131-0500 [DEBUG] Building and walking apply graph for NormalMode plan
2023-03-14T10:51:58.131-0500 [DEBUG] Resource state not found for node "iterative_task.example", instance iterative_task.example
2023-03-14T10:51:58.131-0500 [DEBUG] ProviderTransformer: "iterative_task.example (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/iterative/iterative"]
2023-03-14T10:51:58.131-0500 [DEBUG] ProviderTransformer: "iterative_task.example" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/iterative/iterative"]
2023-03-14T10:51:58.131-0500 [DEBUG] ReferenceTransformer: "iterative_task.example (expand)" references: []
2023-03-14T10:51:58.131-0500 [DEBUG] ReferenceTransformer: "iterative_task.example" references: []
2023-03-14T10:51:58.131-0500 [DEBUG] ReferenceTransformer: "provider[\"registry.terraform.io/iterative/iterative\"]" references: []
2023-03-14T10:51:58.132-0500 [DEBUG] Starting graph walk: walkApply
2023-03-14T10:51:58.161-0500 [DEBUG] created provider logger: level=info
2023-03-14T10:51:58.161-0500 [INFO] provider: configuring client automatic mTLS
2023-03-14T10:51:58.220-0500 [INFO] provider.terraform-provider-iterative: configuring server automatic mTLS: timestamp=2023-03-14T10:51:58.220-0500
2023-03-14T10:51:58.256-0500 [INFO] provider.terraform-provider-iterative: 2023/03/14 10:51:58 [WARN] Truncating attribute path of 0 diagnostics for TypeSet: timestamp=2023-03-14T10:51:58.256-0500
iterative_task.example: Creating...
2023-03-14T10:51:58.258-0500 [INFO] Starting apply for iterative_task.example
2023-03-14T10:51:58.258-0500 [DEBUG] iterative_task.example: applying the planned Create change
2023-03-14T10:51:58.259-0500 [INFO] provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "addresses" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
2023-03-14T10:51:58.259-0500 [INFO] provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "status" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
2023-03-14T10:51:58.259-0500 [INFO] provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "logs" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
2023-03-14T10:51:58.259-0500 [INFO] provider.terraform-provider-iterative: 2023/03/14 10:51:58 [DEBUG] setting computed for "events" from ComputedKeys: timestamp=2023-03-14T10:51:58.259-0500
TPI [INFO] Creation may take several minutes (consider increasing `timeout` https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#timeout). Please wait.
TPI [INFO] Creating resources...
TPI [INFO] [1/11] Parsing PermissionSet...
TPI [INFO] [2/11] Importing DefaultVPC...
TPI [INFO] [3/11] Importing DefaultVPCSubnets...
TPI [INFO] [4/11] Reading Image...
TPI [INFO] [5/11] Creating Bucket...
TPI [INFO] [6/11] Creating SecurityGroup...
TPI [INFO] [7/11] Creating KeyPair...
TPI [INFO] [8/11] Reading Credentials...
TPI [INFO] [9/11] Creating LaunchTemplate...
TPI [INFO] [10/11] Creating AutoScalingGroup...
iterative_task.example: Still creating... [10s elapsed]
TPI [INFO] [11/11] Starting task...
TPI [INFO] Creation completed
2023-03-14T10:52:09.101-0500 [ERROR] plugin.(*GRPCProvider).ApplyResourceChange: error="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-03-14T10:52:09.126-0500 [ERROR] vertex "iterative_task.example" error: Plugin did not respond
╷
│ Error: Plugin did not respond
│
│ with iterative_task.example,
│ on main.tf line 13, in resource "iterative_task" "example":
│ 13: resource "iterative_task" "example" {
│
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may contain more details.
╵
Stack trace from the terraform-provider-iterative plugin:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x27f9d9f]
goroutine 55 [running]:
terraform-provider-iterative/iterative/utils.SystemInfo()
terraform-provider-iterative/iterative/utils/analytics.go:72 +0x3f
terraform-provider-iterative/iterative/utils.JitsuEventPayload({0x45433d7, 0xa}, {0x0?, 0x0}, 0xc0009f3e00)
terraform-provider-iterative/iterative/utils/analytics.go:317 +0x58
terraform-provider-iterative/iterative/utils.SendJitsuEvent({0x45433d7, 0xa}, {0x0, 0x0}, 0xc0006d42c0?)
terraform-provider-iterative/iterative/utils/analytics.go:372 +0xf9
terraform-provider-iterative/iterative.resourceTaskCreate({0x4c8eef0, 0xc0006d8000}, 0x3f8ace0?, {0x0, 0x0})
terraform-provider-iterative/iterative/resource_task.go:231 +0x7c5
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).create(0xc000172700, {0x4c8ee80, 0xc000a7dec0}, 0x2?, {0x0, 0x0})
github.com/hashicorp/terraform-plugin-sdk/v2@v2.8.0/helper/schema/resource.go:330 +0x12e
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc000172700, {0x4c8ee80, 0xc000a7dec0}, 0xc000a90d00, 0xc000ab0b00, {0x0, 0x0})
github.com/hashicorp/terraform-plugin-sdk/v2@v2.8.0/helper/schema/resource.go:456 +0x705
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc000197170, {0x4c8ee80, 0xc000a7dec0}, 0xc000ab23c0)
github.com/hashicorp/terraform-plugin-sdk/v2@v2.8.0/helper/schema/grpc_provider.go:977 +0xde5
github.com/hashicorp/terraform-plugin-go/tfprotov5/tf5server.(*server).ApplyResourceChange(0xc00013d600, {0x4c8ef28?, 0xc000a9bd40?}, 0x0?)
github.com/hashicorp/terraform-plugin-go@v0.4.0/tfprotov5/tf5server/server.go:332 +0x6c
github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler({0x43d42e0?, 0xc00013d600}, {0x4c8ef28, 0xc000a9bd40}, 0xc000804b60, 0x0)
github.com/hashicorp/terraform-plugin-go@v0.4.0/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:380 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004cc960, {0x4c98680, 0xc0002f6d00}, 0xc000a9eea0, 0xc000910cc0, 0x64e5520, 0x0)
google.golang.org/grpc@v1.49.0/server.go:1301 +0xb2b
google.golang.org/grpc.(*Server).handleStream(0xc0004cc960, {0x4c98680, 0xc0002f6d00}, 0xc000a9eea0, 0x0)
google.golang.org/grpc@v1.49.0/server.go:1642 +0xa2f
google.golang.org/grpc.(*Server).serveStreams.func1.2()
google.golang.org/grpc@v1.49.0/server.go:938 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
google.golang.org/grpc@v1.49.0/server.go:936 +0x28a
Error: The terraform-provider-iterative plugin crashed!
This is always indicative of a bug within the plugin. It would be immensely
helpful if you could report the crash with the plugin's maintainers so that it
can be fixed. The output above should help diagnose the issue.
So I SSH'd into a Digital Ocean Linux machine, cloned my repo and installed Terraform, and ran it.
And what do you know, it worked perfectly! Delightful in itself, and shows the problem is something to do with my local environment/OS. I may try running on another OSX machine tomorrow.
Thanks, I had trouble getting this same exact error. note if you use the leo binary in the releases you can get slightly easier access to your scripts logs, ./leo --cloud aws --region us-west-1 read --follow tpi-xxxx
Thanks, good to know! I will give that a go
I ran it on an M1 OSX MBP, and it also worked.
I think it is safe to say there is something odd going on with my local environment (non-M1 OSX). I can't imagine what, I don't have a firewall or anything super weird.
Okay, I think I am closing in on a solution here. The jobs did not crash in non-local environments, but they did not succeed either.
On Amazon, I have no vCPUs allowed for F spot instances. This fits the behavior observed - the plugin gets all the way through creating a Bucket, SecurityGroup, KeyPair, and so on. And then fails when it actually starts the task, because there are no resources allowed.
AWS standard spot instances are A, C, D, H, I, M, R, T, Z. But as far as I can tell, TPI only fires off requests for F spot instances.
I have put in a request to up the quota/limit for spot instances.
@dacbd Do you happen to know how many vCPUs on AWS this request takes?
I have 64 F Spot instance vCPUs, 768 HPC vCPUs, and for standard spot instance, 256 vCPUs.
However locally the request is still failing with
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x27f9d9f]
On other computers, it fails with a different error message that indicates there are not enough spot instances.
Perhaps I haven't understood the documentation, but I don't see a way to change the type of spot instance requested.
thanks for your patience 😅
@coyotespike can you share a snippet of what you are doing? Im not exactly sure what you are referring to?
I believe there is a default limit of 256 vCPUs from was per region on was accounts, is this what you are referring to?
Closing as stale; @coyotespike, please feel free to reopen if you're still experiencing this issue.
Hey y'all,
Thanks for this awesome tool and the fantastic blog posts introducing them.
After cloning the repo, I created
variables.tf
andterraform.tfvars
files, as I've done for my other projects. Terraform should read environmental variable in from these files, so I put my AWS auth credentials into thetfvars
file.Running
apply
gave me this error:Exporting them on the command line got me much farther in the process, but then it crashed again.
And then a very similar runtime error.
train.py
in the example repo, but no love.Would love any help getting the plugin to work!