scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

fullscan thread failed after running for 2 minutes #8287

Closed yarongilor closed 1 week ago

yarongilor commented 3 months ago

Packages

Scylla version: 6.1.0~dev-20240625.c80dc5715668 with build-id bf0032dbaafe5e4d3e01ece0dcb7785d2ec7a098

Kernel Version: 5.15.0-1063-aws

Issue description

Describe your issue in detail and steps it took to produce it.

A (side effect) followup of https://github.com/scylladb/scylladb/issues/19573#issuecomment-2200597882

the run_fullscan is configured to run every minute:

run_fullscan: ['{"mode": "table_and_aggregate", "ks_cf": "keyspace1.standard1", "interval": 60}']

So probably the thread failed for some reason (not expected).

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-09006ca344092e50b (aws: undefined_region)

Test: longevity-50gb-3days-test Test id: 04447069-14c9-47f9-8cf8-fe71a0ec1276 Test name: scylla-master/tier1/longevity-50gb-3days-test

Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 04447069-14c9-47f9-8cf8-fe71a0ec1276` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=04447069-14c9-47f9-8cf8-fe71a0ec1276) - Show all stored logs command: `$ hydra investigate show-logs 04447069-14c9-47f9-8cf8-fe71a0ec1276` ## Logs: - **db-cluster-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/db-cluster-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/db-cluster-04447069.tar.gz) - **sct-runner-events-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-runner-events-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-runner-events-04447069.tar.gz) - **sct-04447069.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-04447069.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-04447069.log.tar.gz) - **loader-set-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/loader-set-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/loader-set-04447069.tar.gz) - **monitor-set-04447069.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/monitor-set-04447069.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/monitor-set-04447069.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-50gb-3days-test/23/) [Argus](https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=04447069-14c9-47f9-8cf8-fe71a0ec1276)
fruch commented 3 months ago

@yarongilor from whom wrote that feature, I would expect to have a bit more insights into why this has failed, and why the check aren't continuing.

i.e. I don't expect to ever see ThreadFailedEvent, if they pop up, it mean the code didn't handled some exception, and that thread stop executing

yarongilor commented 3 months ago

probably failed for the same root cause of https://github.com/scylladb/scylla-cluster-tests/issues/8292

fruch commented 1 month ago

probably failed for the same root cause of #8292

@yarongilor

8292 is now closed, that meaning this can be closed ?

yarongilor commented 1 month ago

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: https://github.com/scylladb/scylla-cluster-tests/commit/64ded4b13653563c582facd9b1aa554dfeb4a72e#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52

the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..
temichus commented 1 month ago

2 minutes?

i see 1st result report

 t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |        op_type         |      duration      | exceptions | nemesis_at_start | nemesis_at_end | success |                                    cmd                                    |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent | 24.877952337265015 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+

ant last

< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > Thread stats:
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |        op_type         |       duration      | exceptions | nemesis_at_start | nemesis_at_end | success |                                    cmd                                    |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  24.877952337265015 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      |  0.3228580951690674 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  28.062369346618652 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  21.24626350402832  |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  28.70813512802124  |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.17992639541625977 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.12150192260742188 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.17914700508117676 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
<

last executing attempt that I see

< t:2024-06-29 03:32:14,467 f:operations_thread.py l:140  c:ScanOperationThread  p:DEBUG > Operation instance map: {'table_and_aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96dd0>, 'random': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e97910>, 'table': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e975b0>, 'partition': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96ef0>, 'aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e979a0>}
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,469 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:32:14.467: (FullScanEvent Severity.NORMAL) period_type=begin event_id=709bd269-6371-49a2-9fb0-5bd0e4a5f8e0 node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s

after that - silence, not clear why

but thread is alive at the end of tests

2024-06-30 04:17:59,475 f:threads_and_processes_alive.py l:66   c:sdcm.utils.threads_and_processes_alive p:ERROR > There are some threads left alive from following modules: sdcm.tester,multiprocessing.queues,socketserver,cassandra.io.libevreactor,sdcm.utils.operations_thread,sdcm.nemesis,cassandra.cluster,cassandra.connection,concurrent.futures.thread,sdcm.remote.libssh2_client
temichus commented 1 month ago

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52

the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

this commit was merge 1.5 years ago

and before this log thread is working

< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
yarongilor commented 1 month ago

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52 the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

this commit was merge 1.5 years ago

and before this log thread is working

< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..

@temichus , it only get the problem when read_pages = 0. It is left running in an endless loop of:

        while result.has_more_pages and pages <= read_pages:
            if read_pages > 0:
                pages += 1

sending a fix now