scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 93 forks source link

`disrupt_hot_reloading_internode_certificate` not seeing the reload message #7882

Closed juliayakovlev closed 3 months ago

juliayakovlev commented 3 months ago

Packages

Scylla version: 2023.1.10-20240706.21cffccc1ccd with build-id 87fcdaf894b5dfb25d26f12d7d4aed71b84bfd44

Kernel Version: 5.15.0-1067-azure

Issue description

The issue https://github.com/scylladb/scylla-cluster-tests/issues/5354 is back.

Despite the certificate was reloaded and the appropriate message is in the log of every node, we do not find it.

node1

< t:2024-07-07 05:20:17,675 f:db_log_reader.py l:123  c:sdcm.db_log_reader   p:DEBUG > 2024-07-07T05:20:17.613+00:00 longevity-tls-1tb-7d-2023-1-db-node-eastus-1     !INFO | scylla[12239]:  [shard  7] messaging_service - Reloaded {/etc/scylla/ssl_conf/db.crt}

node2

< t:2024-07-07 05:20:18,739 f:db_log_reader.py l:123  c:sdcm.db_log_reader   p:DEBUG > 2024-07-07T05:20:18.646+00:00 longevity-tls-1tb-7d-2023-1-db-node-eastus-2     !INFO | scylla[9656]:  [shard 10] messaging_service - Reloaded {/etc/scylla/ssl_conf/db.crt}

node3

< t:2024-07-07 05:20:19,368 f:db_log_reader.py l:123  c:sdcm.db_log_reader   p:DEBUG > 2024-07-07T05:20:19.300+00:00 longevity-tls-1tb-7d-2023-1-db-node-eastus-3     !INFO | scylla[9479]:  [shard  1] messaging_service - Reloaded {/etc/scylla/ssl_conf/db.crt}

node4

< t:2024-07-07 05:20:19,794 f:db_log_reader.py l:123  c:sdcm.db_log_reader   p:DEBUG > 2024-07-07T05:20:19.700+00:00 longevity-tls-1tb-7d-2023-1-db-node-eastus-4     !INFO | scylla[9570]:  [shard  6] messaging_service - Reloaded {/etc/scylla/ssl_conf/db.crt}

And first time failure happens later:

< t:2024-07-07 05:20:20,214 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > 'check_ssl_reload_log': failed with 'LogContentNotFound('Reload SSL message not found in node log')', retrying [#0]

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-2023.1.10-x86_64-2024-07-07T02-51-37 (azure: undefined_region)

Test: longevity-1tb-5days-azure-test Test id: 2aaff180-b24e-4a14-a991-5c0edfe618ec Test name: enterprise-2023.1/longevity/longevity-1tb-5days-azure-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 2aaff180-b24e-4a14-a991-5c0edfe618ec` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=2aaff180-b24e-4a14-a991-5c0edfe618ec) - Show all stored logs command: `$ hydra investigate show-logs 2aaff180-b24e-4a14-a991-5c0edfe618ec` ## Logs: - **db-cluster-2aaff180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/db-cluster-2aaff180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/db-cluster-2aaff180.tar.gz) - **sct-runner-events-2aaff180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/sct-runner-events-2aaff180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/sct-runner-events-2aaff180.tar.gz) - **sct-2aaff180.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/sct-2aaff180.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/sct-2aaff180.log.tar.gz) - **loader-set-2aaff180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/loader-set-2aaff180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/loader-set-2aaff180.tar.gz) - **monitor-set-2aaff180.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/monitor-set-2aaff180.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2aaff180-b24e-4a14-a991-5c0edfe618ec/20240707_160727/monitor-set-2aaff180.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-1tb-5days-azure-test/17/) [Argus](https://argus.scylladb.com/test/bf1550bc-0943-4e5a-ab22-03fcc781d580/runs?additionalRuns[]=2aaff180-b24e-4a14-a991-5c0edfe618ec)
dimakr commented 3 months ago

This issue should be fixed by this revert https://github.com/scylladb/scylla-cluster-tests/pull/7887 and its backports.