scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
53 stars 92 forks source link

scylla-bench raises invalid peer error when bootstarping 3 nodes in parallel #7705

Closed soyacz closed 5 days ago

soyacz commented 2 months ago

Packages

Scylla version: 6.1.0~dev-20240619.7567b87e72cf with build-id f0a41eece67ec7a8d64b047e743f0396d9f23463

Kernel Version: 5.15.0-1063-aws

Issue description

E.g. AWSCluster class, when calling add_nodes which create instances, after creation appends these nodes to cluster object. Before nodes are bootstrapped the're available for stress commands and this causes problems upon scylla-bench startup:

2024-06-21 07:23:35.600 <2024-06-21 07:23:35.000>: (ScyllaBenchLogEvent Severity.CRITICAL) period_type=one-time event_id=0206b9d5-f081-47cd-a64e-fe93405253a8 during_nemesis=GrowShrinkCluster: type=ParseDistributionError regex=missing parameter|unexpected parameter|unsupported|invalid line_number=1 node=Node longevity-5gb-1h-GrowShrinkClusterN-loader-node-a5cec54d-1 [54.246.45.116 | 10.4.0.204]
2024/06/21 07:23:35 Found invalid peer '[HostInfo hostname="" connectAddress="10.4.1.101" peer="10.4.1.101" rpc_address="<nil>" broadcast_address="<nil>" preferred_ip="<nil>" connect_addr="10.4.1.101" connect_addr_source="connect_address" port=9042 data_centre="" rack="" host_id="51e2b480-57b0-4bd3-a371-9ed355c6523c" version="v0.0.0" state=UP num_tokens=0]' Likely due to a gossip or snitch issue, this host will be ignored

Impact

This fails the test.

How frequently does it reproduce?

Possibly often for individual GrowShrink nemesis test, where nemesis is run during prepare.

Installation details

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

OS / Image: ami-038debe0a5d79391a (aws: undefined_region)

Test: longevity-5gb-1h-GrowShrinkClusterNemesis-aws-test Test id: a5cec54d-9a8b-43ea-95f5-5645bc32791b Test name: scylla-staging/lukasz/longevity-5gb-1h-GrowShrinkClusterNemesis-aws-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor a5cec54d-9a8b-43ea-95f5-5645bc32791b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a5cec54d-9a8b-43ea-95f5-5645bc32791b) - Show all stored logs command: `$ hydra investigate show-logs a5cec54d-9a8b-43ea-95f5-5645bc32791b` ## Logs: - **db-cluster-a5cec54d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/db-cluster-a5cec54d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/db-cluster-a5cec54d.tar.gz) - **sct-runner-events-a5cec54d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/sct-runner-events-a5cec54d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/sct-runner-events-a5cec54d.tar.gz) - **sct-a5cec54d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/sct-a5cec54d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/sct-a5cec54d.log.tar.gz) - **loader-set-a5cec54d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/loader-set-a5cec54d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/loader-set-a5cec54d.tar.gz) - **monitor-set-a5cec54d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/monitor-set-a5cec54d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/monitor-set-a5cec54d.tar.gz) - **parallel-timelines-report-a5cec54d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/parallel-timelines-report-a5cec54d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a5cec54d-9a8b-43ea-95f5-5645bc32791b/20240621_073135/parallel-timelines-report-a5cec54d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/lukasz/job/longevity-5gb-1h-GrowShrinkClusterNemesis-aws-test/12/) [Argus](https://argus.scylladb.com/test/a2af091e-3cbd-4068-b825-35d511f15d17/runs?additionalRuns[]=a5cec54d-9a8b-43ea-95f5-5645bc32791b)
soyacz commented 2 months ago

actually it doesn't depend on stress command. I retested with first running s-b and then growing cluster with 3 nodes and errors appear in s-b output. 2024/06/21 08:41:11 Found invalid peer '[HostInfo hostname="" connectAddress="10.4.3.48" peer="10.4.3.48" rpc_address="<nil>" broadcast_address="<nil>" preferred_ip="<nil>" connect_addr="10.4.3.48" connect_addr_source="connect_address" port=9042 data_centre="" rack="" host_id="5adb5164-ce98-4dca-b88f-2da728d78f42" version="v0.0.0" state=UP num_tokens=0]' Likely due to a gossip or snitch issue, this host will be ignored s-b continues work, but test ends with critical stress event (as we treat invalid as no-go)

@roydahan @fruch is this s-b/gocql driver issue or we just need to ignore this error?

fruch commented 2 months ago

It's not the first time we see the print in go driver

But I think this should be reported on the driver to get it figured

Topology changes are a problematic area for drivers (at least from my python driver knowledge), and doing that in parallel was never tested with the drivers...

fruch commented 2 months ago

Also exactly the reasons why I didn't wanted to enable parallel bootstrap by default

vponomaryov commented 2 months ago

@soyacz the scylla-bench-v0.1.20 version uses pretty old gocql driver version -> v1.12.1-0.20240207140227-3c32c6cd75e5 (2024 Feb 7) It is dev version which became the v1.13.0 one. Now the latest version is the v1.14.1.

So, I think it will be more correct to make the S-B use latest driver version, retest tablets with it and if the problem stays around then file bug against the scylladb/gocql.

soyacz commented 2 months ago

@soyacz the scylla-bench-v0.1.20 version uses pretty old gocql driver version -> v1.12.1-0.20240207140227-3c32c6cd75e5 (2024 Feb 7) It is dev version which became the v1.13.0 one. Now the latest version is the v1.14.1.

So, I think it will be more correct to make the S-B use latest driver version, retest tablets with it and if the problem stays around then file bug against the scylladb/gocql.

Good idea. What is the process? Who's going to do it?

soyacz commented 2 months ago

Also exactly the reasons why I didn't wanted to enable parallel bootstrap by default

but this is how we find bugs...

vponomaryov commented 2 months ago

@soyacz the scylla-bench-v0.1.20 version uses pretty old gocql driver version -> v1.12.1-0.20240207140227-3c32c6cd75e5 (2024 Feb 7) It is dev version which became the v1.13.0 one. Now the latest version is the v1.14.1. So, I think it will be more correct to make the S-B use latest driver version, retest tablets with it and if the problem stays around then file bug against the scylladb/gocql.

Good idea. What is the process? Who's going to do it?

I will do it.

Upd: https://github.com/scylladb/scylla-bench/pull/141

Upd2: SCT PR:

Docker image:

fruch commented 2 months ago

Also exactly the reasons why I didn't wanted to enable parallel bootstrap by default

but this is how we find bugs...

Yes, but also how one generates more work, then one has planned.

The ownership for parallel bootstrap should be clear, not something done on a volunteering basis

soyacz commented 2 months ago

Also exactly the reasons why I didn't wanted to enable parallel bootstrap by default

but this is how we find bugs...

Yes, but also how one generates more work, then one has planned.

I think this one has big potential in finding bugs and will generate work. But I agree, should be more planned.

The ownership for parallel bootstrap should be clear, not something done on a volunteering basis

This got to me as part of elasticity test, also I saw https://github.com/scylladb/scylla-cluster-tests/pull/7441 with which I disagree at the base level.

Concluding, in my opinion, after merging 'parallel operations' support in SCT, proper team should take responsibility in testing this feature by enabling it, running and raising issues like this.

fruch commented 2 months ago

we should limit the regex to more specific error coming from s-b

mykaul commented 2 months ago

@fee-mendes - isn't that the same issue you saw in your tablets scale out test?

fee-mendes commented 2 months ago

@fee-mendes - isn't that the same issue you saw in your tablets scale out test?

Yes all the discussion in https://github.com/scylladb/scylladb/issues/19107#issuecomment-2186639040 is related

soyacz commented 2 months ago

@mykaul @fee-mendes we investigated the case with @fruch and found that these errors are transient and the driver recovers with next peer refresh (1s). Possibly severity of it is low.

fee-mendes commented 2 months ago

Yeah, it will be transient until the driver reads a valid peer refresh. The call out in the linked issue that vNodes bootstrap can extend this state can become a problem, and require its own set of testing. :/

kbr-scylla commented 2 months ago

Judging from the message: Found invalid peer ... this host will be ignored

Here, unlike in https://github.com/scylladb/scylladb/issues/19107, we ignore only this one peer, but the rest of system.peers is processed as usual.

In that case there should be no observable impact on anything (except that the test framework considers this a critical error and decides to fail the test). That's because the rows that the driver considers "invalid" and ignores in 6.0, would not exist in the first place in 5.4. So -- they didn't exist before, and now they are transparent (just log a warning).

Then the severity is not only low. It is 0. Whether we use tablets or vnodes.

fruch commented 2 months ago

Judging from the message: Found invalid peer ... this host will be ignored

Here, unlike in https://github.com/scylladb/scylladb/issues/19107, we ignore only this one peer, but the rest of system.peers is processed as usual.

In that case there should be no observable impact on anything (except that the test framework considers this a critical error and decides to fail the test). That's because the rows that the driver considers "invalid" and ignores in 6.0, would not exist in the first place in 5.4. So -- they didn't exist before, and now they are transparent (just log a warning).

Then the severity is not only low. It is 0. Whether we use tablets or vnodes.

That's basically what @soyacz said above, and as you can see the issue is in SCT repo, and we should be fixing it, to not treat it as a critical error, but as warning that it is.

kbr-scylla commented 2 months ago

Scylla issue: https://github.com/scylladb/scylladb/issues/19507