Open juliayakovlev opened 3 weeks ago
problem is that due some reason this run was not created in Argus at all:
< t:2024-10-27 04:14:54,161 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:54,729 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
I don't know yet the root cause.
problem is that due some reason this run was not created in Argus at all:
< t:2024-10-27 04:14:54,161 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'} < t:2024-10-27 04:14:54,729 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
I don't know yet the root cause.
This could mean that test failed creation on a separate stage, but then got re-created by the sct itself, hence the "run not found" exception. Next response indicates that it got created, but something maybe dropped it later. Maybe id re-use or a consistency issue. Need to check all stages that interacted with argus
problem is that due some reason this run was not created in Argus at all:
< t:2024-10-27 04:14:54,161 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'} < t:2024-10-27 04:14:54,729 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
I don't know yet the root cause.
This could mean that test failed creation on a separate stage, but then got re-created by the sct itself, hence the "run not found" exception. Next response indicates that it got created, but something maybe dropped it later. Maybe id re-use or a consistency issue. Need to check all stages that interacted with argus
It could also mean that another test (read/write) did this, but mixed didn't.
@k0machi what's up with this one ?
seems like we don't have "Create Argus Test Run" in the pefRegressionParallelPipeline.groovy
regardless we don't expect creation to fail, and we need to be able to lookup the logs for this failure, to understand it (on argus end)
@k0machi what's up with this one ?
seems like we don't have "Create Argus Test Run" in the pefRegressionParallelPipeline.groovy
regardless we don't expect creation to fail, and we need to be able to lookup the logs for this failure, to understand it (on argus end)
Without the explicit stage the run is created inside "Run SCT Stages" stage, specifically during ClusterTester init, so the cause for failure should be visible in the logs for that particular SCT run.
Looking into this, I see a weird issue:
So here's the write test, initializing succesfuly:
< t:2024-10-27 04:13:51,469 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '70fa1e8d-a52e-41ec-aeb0-9da8ef711ec6'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:13:52,035 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
< t:2024-10-27 04:13:52,387 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'updated', 'status': 'ok'}
< t:2024-10-27 04:13:52,743 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'added', 'status': 'ok'}
And here's the mixed test's argus.log:
< t:2024-10-27 04:14:54,161 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:54,729 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
< t:2024-10-27 04:14:55,080 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:55,430 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': [], 'exception': 'DoesNotExist'}, 'status': 'error'}
< t:2024-10-27 04:14:55,431 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:15:18,778 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:15:25,812 f:base.py l:65 c:argus.client.base p:DEBUG > API Response: {'response': {'arguments': [], 'exception': 'DoesNotExist'}, 'status': 'error'}
So this is weird - API responded that it had succesfully saved the run, yet it does not exist for subsequent calls. Manually querying the run doesn't help - it indeed doesn't exist:
I suspect something happened with scylla at that moment - unfortunately the logs for this are not available anymore (and I suspect the application logs won't show much, since the API did respond with success - which means something happened on scylla's end).
That is very disturbing.
Did you check on each node separately?
It's not the first time we run into such cases, last time it was during upgrade of Scylla or node replacement
What more information we can log to help figure out the next time
(We should collect logs, as soon as issues are reported, or archive logs periodically to s3 to something like that)
That is very disturbing.
Did you check on each node separately?
It's not the first time we run into such cases, last time it was during upgrade of Scylla or node replacement
What more information we can log to help figure out the next time
(We should collect logs, as soon as issues are reported, or archive logs periodically to s3 to something like that)
I haven't checked individual nodes yet, I'll do that. We should collect logs periodically, maybe have a github action that would do a snapshot of last N hours of production each time an issue is reported?
scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes
test runs 3 load stages.write
andread
stages are passed:read write
But
mixed stage
failed with :https://jenkins.scylladb.com/job/scylla-enterprise/job/perf-regression/job/scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes/20/