scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 94 forks source link

output isn't clear enough on wrong parameter given to jobs ("no test_id provided or found") #7152

Open avikivity opened 8 months ago

avikivity commented 8 months ago

Issue description

Steps to Reproduce

  1. Run a performance-regression test: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/console

Expected behavior: [What you expected to happen]

test runs

Actual behavior: [What actually happened]

Error: "no test id provided or found"

I don't know what a test_id is or how to provide it

Impact

Test doesn't run

How frequently does it reproduce?

Every time

Installation details

SCT Version: https://github.com/xemul/scylla-cluster-tests.git br-branch-perf-v14

Scylla version (or git commit hash): A private version, but didn't get to run it

Logs

avikivity commented 8 months ago

Afterwards it dies with

20:14:06  Failed to find test directory for None
20:14:06  Traceback (most recent call last):
20:14:06    File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 1666, in <module>
20:14:06      cli.main(prog_name="hydra")
20:14:06    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
20:14:06      rv = self.invoke(ctx)
20:14:06    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
20:14:06      return _process_result(sub_ctx.command.invoke(sub_ctx))
20:14:06    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
20:14:06      return ctx.invoke(self.callback, **ctx.params)
20:14:06    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
20:14:06      return __callback(*args, **kwargs)
20:14:06    File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 1259, in send_email
20:14:06      logs = list_logs_by_test_id(test_results.get('test_id', test_id))
20:14:06  AttributeError: 'NoneType' object has no attribute 'get'
avikivity commented 8 months ago

Another error:

20:31:38  Test runner ip in update_sct_runner_tags: None; test_id: None
20:31:38  Traceback (most recent call last):
20:31:38    File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 1666, in <module>
20:31:38      cli.main(prog_name="hydra")
20:31:38    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
20:31:38      rv = self.invoke(ctx)
20:31:38    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
20:31:38      return _process_result(sub_ctx.command.invoke(sub_ctx))
20:31:38    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
20:31:38      return ctx.invoke(self.callback, **ctx.params)
20:31:38    File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
20:31:38      return __callback(*args, **kwargs)
20:31:38    File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 1182, in collect_logs
20:31:38      update_sct_runner_tags(test_id=collector.test_id, tags={"logs_collected": True})
20:31:38    File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_runner.py", line 1060, in update_sct_runner_tags
20:31:38      raise ValueError("update_sct_runner_tags requires either the "
20:31:38  ValueError: update_sct_runner_tags requires either the test_runner_ip or test_id argument to find the runner

but how am I supposed to supply test_id?

avikivity commented 8 months ago

I found that if I set scylla_version it goes away.

fruch commented 8 months ago

I better to look at such a pipeline like that: https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-graph/

and then on: https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?selected-node=508

and see it's failing cause if this:

[111](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-111)
19:41:52  + echo 'need to choose one of SCT_AMI_ID_DB_SCYLLA | SCT_SCYLLA_VERSION | SCT_SCYLLA_REPO | SCT_GCE_IMAGE_DB'
[112](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-112)
19:41:52  need to choose one of SCT_AMI_ID_DB_SCYLLA | SCT_SCYLLA_VERSION | SCT_SCYLLA_REPO | SCT_GCE_IMAGE_DB
[113](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-113)
19:41:52  + exit 1
[114](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-114)
script returned exit code 1

@avikivity, it would help if this error would be in the run description ? (i.e. all 3 occurrences of it)

mykaul commented 8 months ago

It'd be great if Jenkins descriptions added what parameters are mandatory. I believe there is NO plugin to actually verify that - but I may be wrong here.

avikivity commented 8 months ago

I better to look at such a pipeline like that: https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-graph/

and then on: https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?selected-node=508

and see it's failing cause if this:

[111](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-111)
19:41:52  + echo 'need to choose one of SCT_AMI_ID_DB_SCYLLA | SCT_SCYLLA_VERSION | SCT_SCYLLA_REPO | SCT_GCE_IMAGE_DB'
[112](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-112)
19:41:52  need to choose one of SCT_AMI_ID_DB_SCYLLA | SCT_SCYLLA_VERSION | SCT_SCYLLA_REPO | SCT_GCE_IMAGE_DB
[113](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-113)
19:41:52  + exit 1
[114](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-114)
script returned exit code 1

@avikivity, it would help if this error would be in the run description ? (i.e. all 3 occurrences of it)

Yes it would be helpful.

Once you detect an error, you should stop. Continuing means I have to guess what the problem is.

fruch commented 8 months ago

I better to look at such a pipeline like that: https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-graph/ and then on: https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?selected-node=508 and see it's failing cause if this:

[111](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-111)
19:41:52  + echo 'need to choose one of SCT_AMI_ID_DB_SCYLLA | SCT_SCYLLA_VERSION | SCT_SCYLLA_REPO | SCT_GCE_IMAGE_DB'
[112](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-112)
19:41:52  need to choose one of SCT_AMI_ID_DB_SCYLLA | SCT_SCYLLA_VERSION | SCT_SCYLLA_REPO | SCT_GCE_IMAGE_DB
[113](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-113)
19:41:52  + exit 1
[114](https://jenkins.scylladb.com/job/scylla-staging/job/avi/job/avi-regression-latency-650gb-grow-shrink/2/pipeline-console/?start-byte=0&selected-node=494#log-114)
script returned exit code 1

@avikivity, it would help if this error would be in the run description ? (i.e. all 3 occurrences of it)

Yes it would be helpful.

Once you detect an error, you should stop. Continuing means I have to guess what the problem is.

it's not simple as that, in most cases if we have error during the test, the next phase of collecting the logs, and clearing the resources are quite important, the cases we don't reach that start of the test, are more rare.

fruch commented 7 months ago

We should create a new step on the pipeline as early as possible that would validate the job parameter and fail fast if they are incorrect

fruch commented 2 months ago

coming back to this one: