scylladb / scylla-stress-orchestrator

Apache License 2.0
5 stars 9 forks source link

Beginner doesn't know how to proceed after failure #39

Open nyh opened 2 years ago

nyh commented 2 years ago

I'm running scylla stress orchestrator for the first time, following the instructions in https://github.com/scylladb/scylla-stress-orchestrator/wiki/Welcome-to-the-Scylla-Stress-Orchestrator - I did exactly what's written there except one change to variables.tf - I changed cluster_instance_type to "i4i.xlarge" (because I was told that this instance type is more cost-effective). I don't know if this change caused the problem I had next:

Then I ran provision_terraform, and there was a lot of seemingly-successful output as things were being done, until, finally, an error:

15:24:09 INFO  null_resource.configure-prometheus: Creation complete after 1m35s [id=5497209349854640368]
15:24:17 INFO  null_resource.cluster[0]: Still creating... [2m30s elapsed]
15:24:27 INFO  null_resource.cluster[0]: Still creating... [2m40s elapsed]
15:24:37 INFO  null_resource.cluster[0]: Still creating... [2m50s elapsed]
15:24:47 INFO  null_resource.cluster[0]: Still creating... [3m0s elapsed]
15:24:48 INFO  null_resource.cluster[0] (remote-exec): Job for scylla-server.service failed because the control process exited with error code. See "systemctl status scylla-server.service" and "journalctl -xe" for details.
15:24:48 WARN  ╷
15:24:48 WARN  │ Error: remote-exec provisioner error
15:24:48 WARN  │ 
15:24:48 WARN  │   with null_resource.cluster[0],
15:24:48 WARN  │   on main.tf line 163, in resource "null_resource" "cluster":
15:24:48 WARN  │  163:     provisioner "remote-exec" {
15:24:48 WARN  │ 
15:24:48 WARN  │ error executing "/tmp/terraform_723494412.sh": Process exited with status 1
15:24:48 WARN  ╵
Traceback (most recent call last):
  File "/home/nyh/.local/bin/provision_terraform", line 8, in <module>
    sys.exit(provision())
  File "/home/nyh/.local/lib/python3.9/site-packages/scyllaso/bin/provision_terraform.py", line 32, in provision
    terraform.apply(get_plan(args), workspace=args.workspace, options=args.options)
  File "/home/nyh/.local/lib/python3.9/site-packages/scyllaso/terraform.py", line 39, in apply
    raise Exception(f'Failed terraform apply, plan [{terraform_plan}], exitcode={exitcode} command=[{cmd}])')
Exception: Failed terraform apply, plan [ec2-scylla], exitcode=1 command=[terraform -chdir=ec2-scylla apply  -auto-approve])

As a beginner, I found myself baffled what this means. I see the message "job for scylla-server.service failed because the control process exited with error code. See "systemctl status scylla-server.service" and "journalctl -xe" for details." but how can I do this? I don't know what is the IP addresses of the cluster just created...

Is the cluster still alive? I see there is a command "provision_terraform" and "unprovision_terraform", but how do I check what is currently provisioned? I wish that instead of spearate commands provision_terraform/unprovision_terraform we had one command, e.g., "sso", with the second parameter saying what to do - provision, unprovision, list, etc.

By the way, the "unprovision_terraform" worked, so clearly the cluster was still alive.

nyh commented 2 years ago

By the way, I confirmed that the problem that caused this failure is the instance type - if I try the default instance type (i3.2xlarge), it works correctly.

That's another issue - why the instance type that I wanted doesn't work (maybe it just doesn't work with the default AMI), but in this issue I'm asking more about how a beginner can figure out from the error message that: 1. the cluster is still alive, 2. what its IP addresses are, and 3. how to get the real error message about the failed packe.