Closed yarongilor closed 1 week ago
@yarongilor from whom wrote that feature, I would expect to have a bit more insights into why this has failed, and why the check aren't continuing.
i.e. I don't expect to ever see ThreadFailedEvent, if they pop up, it mean the code didn't handled some exception, and that thread stop executing
probably failed for the same root cause of https://github.com/scylladb/scylla-cluster-tests/issues/8292
probably failed for the same root cause of #8292
@yarongilor
probably failed for the same root cause of #8292
@yarongilor #8292 is now closed, that meaning this can be closed ?
No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: https://github.com/scylladb/scylla-cluster-tests/commit/64ded4b13653563c582facd9b1aa554dfeb4a72e#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52
the line of result.fetch_next_page()
is deleted from def fetch_result_pages(self, result, read_pages)
so it should probably put back in its place.
The thread failed quietly like:
< t:2024-06-29 03:30:13,846 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123 c:FullScanOperation p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123 c:FullScanOperation p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 0 result pages..
2 minutes?
i see 1st result report
t:2024-06-29 03:22:52,894 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | op_type | duration | exceptions | nemesis_at_start | nemesis_at_end | success | cmd |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanAggregateEvent | 24.877952337265015 | | None | None | True | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
ant last
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > Thread stats:
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | op_type | duration | exceptions | nemesis_at_start | nemesis_at_end | success | cmd |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanAggregateEvent | 24.877952337265015 | | None | None | True | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanEvent | 0.3228580951690674 | | None | None | True | SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanAggregateEvent | 28.062369346618652 | | None | None | True | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanAggregateEvent | 21.24626350402832 | | None | None | True | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanAggregateEvent | 28.70813512802124 | | None | None | True | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanEvent | 0.17992639541625977 | | None | None | True | SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanEvent | 0.12150192260742188 | | None | None | True | SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > | FullScanEvent | 0.17914700508117676 | | None | None | True | SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156 c:ScanOperationThread p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
<
last executing attempt that I see
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:140 c:ScanOperationThread p:DEBUG > Operation instance map: {'table_and_aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96dd0>, 'random': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e97910>, 'table': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e975b0>, 'partition': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96ef0>, 'aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e979a0>}
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,469 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:32:14.467: (FullScanEvent Severity.NORMAL) period_type=begin event_id=709bd269-6371-49a2-9fb0-5bd0e4a5f8e0 node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
after that - silence, not clear why
but thread is alive at the end of tests
2024-06-30 04:17:59,475 f:threads_and_processes_alive.py l:66 c:sdcm.utils.threads_and_processes_alive p:ERROR > There are some threads left alive from following modules: sdcm.tester,multiprocessing.queues,socketserver,cassandra.io.libevreactor,sdcm.utils.operations_thread,sdcm.nemesis,cassandra.cluster,cassandra.connection,concurrent.futures.thread,sdcm.remote.libssh2_client
probably failed for the same root cause of #8292
@yarongilor #8292 is now closed, that meaning this can be closed ?
No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52
the line of
result.fetch_next_page()
is deleted fromdef fetch_result_pages(self, result, read_pages)
so it should probably put back in its place. The thread failed quietly like:< t:2024-06-29 03:30:13,846 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully < t:2024-06-29 03:31:13,899 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation < t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123 c:FullScanOperation p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s < t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 1000 result pages.. < t:2024-06-29 03:31:14,428 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully < t:2024-06-29 03:32:14,467 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation < t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123 c:FullScanOperation p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s < t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 0 result pages..
this commit was merge 1.5 years ago
and before this log thread is working
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 1000 result pages..
probably failed for the same root cause of #8292
@yarongilor #8292 is now closed, that meaning this can be closed ?
No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52 the line of
result.fetch_next_page()
is deleted fromdef fetch_result_pages(self, result, read_pages)
so it should probably put back in its place. The thread failed quietly like:< t:2024-06-29 03:30:13,846 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully < t:2024-06-29 03:31:13,899 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation < t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123 c:FullScanOperation p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s < t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 1000 result pages.. < t:2024-06-29 03:31:14,428 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully < t:2024-06-29 03:32:14,467 f:operations_thread.py l:147 c:ScanOperationThread p:DEBUG > Going to run operation FullScanOperation < t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123 c:FullScanOperation p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s < t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 0 result pages..
this commit was merge 1.5 years ago
and before this log thread is working
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188 c:FullScanOperation p:DEBUG > Will fetch up to 1000 result pages..
@temichus , it only get the problem when read_pages = 0. It is left running in an endless loop of:
while result.has_more_pages and pages <= read_pages:
if read_pages > 0:
pages += 1
sending a fix now
Packages
Scylla version:
6.1.0~dev-20240625.c80dc5715668
with build-idbf0032dbaafe5e4d3e01ece0dcb7785d2ec7a098
Kernel Version:
5.15.0-1063-aws
Issue description
Describe your issue in detail and steps it took to produce it.
A (side effect) followup of https://github.com/scylladb/scylladb/issues/19573#issuecomment-2200597882
the
run_fullscan
is configured to run every minute:So probably the thread failed for some reason (not expected).
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-09006ca344092e50b
(aws: undefined_region)Test:
longevity-50gb-3days-test
Test id:04447069-14c9-47f9-8cf8-fe71a0ec1276
Test name:scylla-master/tier1/longevity-50gb-3days-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 04447069-14c9-47f9-8cf8-fe71a0ec1276` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=04447069-14c9-47f9-8cf8-fe71a0ec1276) - Show all stored logs command: `$ hydra investigate show-logs 04447069-14c9-47f9-8cf8-fe71a0ec1276` ## Logs: - **db-cluster-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/db-cluster-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/db-cluster-04447069.tar.gz) - **sct-runner-events-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-runner-events-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-runner-events-04447069.tar.gz) - **sct-04447069.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-04447069.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-04447069.log.tar.gz) - **loader-set-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/loader-set-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/loader-set-04447069.tar.gz) - **monitor-set-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/monitor-set-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/monitor-set-04447069.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-50gb-3days-test/23/) [Argus](https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=04447069-14c9-47f9-8cf8-fe71a0ec1276)