scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
332 stars 162 forks source link

schema disagreement error attempting to insert data after the Scylla upgrade #1150

Closed vponomaryov closed 5 days ago

vponomaryov commented 1 year ago

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

User cannot perform some queries.

How frequently does it reproduce?

It was reproduced 2 times from 2.

Installation details

Kernel Version: 5.15.0-1020-gke Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.2.0.20221225.4c0f7ea09893.tar.gz Operator Image: scylladb/scylla-operator:1.8.0-rc.0 Operator Helm Version: 1.8.0-rc.0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 3 nodes (n1-standard-8)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1-b)

Test: upgrade-major-scylla-k8s-gke Test id: 207bdbdc-673c-4c52-ac37-44faddabe464 Test name: scylla-operator/operator-1.8/upgrade/upgrade-major-scylla-k8s-gke Test config file(s):

<details> <summary>

Running Scylla upgrade from 5.0.5-0.20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee to 5.1.2-0.20221225.4c0f7ea09893 with build-id 4817fe236d57eca203f35b1dbb4bfe43cab72590 on K8S backend (GKE) we faced following problem:

Logs with error:

> Executing CQL 'INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a')' ... 
> Retrying request after UE. Attempt #0                                                                
> [control connection] Schemas mismatched, trying again                                                

... 48 more attempts each 200ms ...                                                                        

> [control connection] Schemas mismatched, trying again                                                
G > Node 10.108.5.7:9042 is reporting a schema disagreement: {UUID('8af28221-bae0-35a1-bd3c-7bb3a7caf720'): [<DefaultEndPoint: 10.112.2.191:9042>, <DefaultEndPoint: 10.108.5.7:9042>], UUID('6e637294-2c1e-3fc9-a573-a83a5fc50e8f'): [<DefaultEndPoint: 10.112.9.194:9042>]}
> Skipping schema refresh due to lack of schema agreement                                              
> [control connection] Waiting for schema agreement                                                    
> Retrying request after UE. Attempt scylladb/scylladb#1                                                                
> [control connection] Schemas mismatched, trying again                                                
> Retrying request after UE. Attempt scylladb/scylladb#2                                                                
> Retrying request after UE. Attempt scylladb/scylladb#3                                                                
> Retrying request after UE. Attempt scylladb/scylladb#4                                                                
> INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a') < t:2023-01-04 17:37:51,821 f:fill_db_data.py l:3255 c:sdcm.fill_db_data    p:ERROR > INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a')
> Traceback (most recent call last):                                                                   
>   File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3252, in _run_db_queries       
>     res = session.execute(item['queries'][i])                                                        
>   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1552, in execute_verbose       
>     return execute_orig(*args, **kwargs)                                                             
>   File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute                       
>   File "cassandra/cluster.py", line 5006, in cassandra.cluster.ResponseFuture.result                 
> cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl QUORUM. Requires 1, alive 0" info={'consistency': 'QUORUM', 'required_replicas': 1, 'alive_replicas': 0}

We run lots of commands, but the same one failed in the same place in 2 different test runs.

And second test run was using enterprise Scylla upgrading from the 2021.1.17-0.20221221.5318a7fec with build-id d4378bd13d179b4bbcde7bdc82b92d8cc71c52d8 to the 2022.1.3-0.20220922.539a55e35 with build-id d1fb2faafd95058a04aad30b675ff7d2b930278d version.

</summary>

Logs:

Jenkins job URL </details>

fruch commented 1 year ago

node-1 (the one being upgraded)

INFO  2023-01-04 17:37:22,424 [shard 0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted, non-shard-aware)
INFO  2023-01-04 17:37:22,424 [shard 0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19042 (unencrypted, shard-aware)

node-2, update the schema:

INFO  2023-01-04 17:37:31,453 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720

node-1, notice the other nodes 2min after, and get the new schema from them:

INFO  2023-01-04 17:38:26,725 [shard 0] gossip - InetAddress 10.112.2.191 is now UP, status = NORMAL
INFO  2023-01-04 17:38:26,726 [shard 0] gossip - InetAddress 10.112.8.121 is now UP, status = NORMAL
INFO  2023-01-04 17:38:26,727 [shard 0] storage_service - Node 10.112.2.191 state jump to normal
INFO  2023-01-04 17:38:26,731 [shard 0] storage_service - Node 10.112.8.121 state jump to normal
...
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Requesting schema pull from 10.112.2.191:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Pulling schema from 10.112.2.191:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Requesting schema pull from 10.112.8.121:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Pulling schema from 10.112.8.121:0
INFO  2023-01-04 17:39:26,833 [shard 0] schema_tables - Altering keyspace_fill_db_data.table_options_test id=6e04e400-8c50-11ed-8fbc-394aebb27b6e version=9373d136-8b14-33a9-9d8b-191e567e7e6b
INFO  2023-01-04 17:39:26,834 [shard 0] schema_tables - Altering keyspace_fill_db_data.table_options_test_scylla_cdc_log id=6e04e402-8c50-11ed-8fbc-394aebb27b6e version=e1098738-72f1-347f-805c-454472f91653
...
INFO  2023-01-04 17:39:26,862 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720
INFO  2023-01-04 17:39:26,863 [shard 0] migration_manager - Schema merge with 10.112.2.191:0 completed
INFO  2023-01-04 17:39:27,078 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720

@vponomaryov, I think this might be an k8s related issue, and we'll need @scylladb/team-operator to take a closer look here.

vponomaryov commented 1 year ago

@fruch Since the https://github.com/orgs/scylladb/teams/team-operator doesn't have members yet, need to mention people explicitly: @tnozicka , @zimnx , @rzetelskik Please, look at it.

DoronArazii commented 1 year ago

@fruch why is it marked as master/triage?

fruch commented 1 year ago

@fruch why is it marked as master/triage?

It was a suspected core issue, seems like it's not the case.

zimnx commented 1 year ago

It was a suspected core issue, seems like it's not the case.

Why do you think it's k8s related?

What's the condition you wait for before you issue an insert?

fruch commented 1 year ago

It was a suspected core issue, seems like it's not the case.

Why do you think it's k8s related?

What's the condition you wait for before you issue an insert?

we are waiting like that:

    def wait_till_scylla_is_upgraded_on_all_nodes(self, target_version: str) -> None:
        def _is_cluster_upgraded() -> bool:
            for node in self.db_cluster.nodes:
                node.forget_scylla_version()
                if node.scylla_version != target_version or not node.db_up:
                    return False
            return True
        wait.wait_for(
            func=_is_cluster_upgraded,
            step=30,
            text="Waiting until all nodes in the cluster are upgraded",
            timeout=900,
            throw_exc=True,
        )

that the version is what we except, and the the CQL port is open.

what else should we need to wait for before using the cluster ?

zimnx commented 1 year ago

In my view, you should look at ScyllaCluster.Status.Conditions - Available=True,Progressing=False,Degraded=False.

Not keeping quorum throught rollouts it's a known issue on k8s - https://github.com/scylladb/scylla-operator/issues/1077

fruch commented 1 year ago

In my view, you should look at ScyllaCluster.Status.Conditions - Available=True,Progressing=False,Degraded=False.

We will look at checking this status as well

Not keeping quorum throught rollouts it's a known issue on k8s - scylladb/scylla-operator#1077

@mykaul is it's agreed it's a operator issue, can you help us move it there ?

@zimnx seems like there's some strong arguments with the suggest solution for https://github.com/scylladb/scylla-operator/issues/1077, is there still moving forward ?

rzetelskik commented 1 year ago

@fruch #1077 is waiting for the input and reviews from the rest of the team in #1108

scylla-operator-bot[bot] commented 2 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle stale

scylla-operator-bot[bot] commented 1 month ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle rotten

scylla-operator-bot[bot] commented 5 days ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/close not-planned

scylla-operator-bot[bot] commented 5 days ago

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/scylladb/scylla-operator/issues/1150#issuecomment-2366721329): >The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. > >This bot triages un-triaged issues according to the following rules: >- After 30d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out > >/close not-planned Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.