Open k0machi opened 1 year ago
@k0machi can you share Argus link? Generally, c-s with keyspace setting should not matter, as keyspace already exist and should not be recreated by c-s. I need to see the logs to see precisely what stage it did happen and investigate a bit more. Also I see we specify datacenter in c-s command - in case of CL=ALL I'm not sure if we should.
@k0machi can you share Argus link? Generally, c-s with keyspace setting should not matter, as keyspace already exist and should not be recreated by c-s. I need to see the logs to see precisely what stage it did happen and investigate a bit more. Also I see we specify datacenter in c-s command - in case of CL=ALL I'm not sure if we should.
guys, we need to make some progress with this one
I don't think the error in keyspace setting in c-s command is important: c-s is not creating keyspace if already exists. This nemesis works for some time and we didn't find issues with that.
Write happens during the alteration via the context manager, read happens outside the context manager
this is intended, we read after dropping added keyspace. This happens with QUORUM and there's no issue with that. Problem is with reading using CL=ALL
What I see is the problem of very poor performance when reading data - c-s shows 3-10ops, sometimes timeouting and eventually leading to error. What's different in this case is that we added a new DC only with one rack, while the first dc has 3 racks. DB nodes show reader concurrency semaphore errors:
longevity-cdc-100gb-4h-master-db-node-571bfeee-4 !INFO | scylla[5705]: [shard 13] reader_concurrency_semaphore - (rate limiting dropped 7916 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 16536/169911255 memory resources: timed out, dumping permit diagnostics:
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > permits count memory table/operation/state
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1 1 16K cdc_test.test_table_postimage/data-query/active/need_cpu
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 496 0 0B cdc_test.test_table/data-query/waiting_for_admission
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1910 0 0B cdc_test.test_table_postimage/data-query/waiting_for_admission
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1382 0 0B cdc_test.test_table_preimage/mutation-query/waiting_for_admission
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2008 0 0B cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1920 0 0B cdc_test.test_table_preimage/data-query/waiting_for_admission
< t:2023-07-23 05:47:06,338 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1526 0 0B cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1425 0 0B cdc_test.test_table_postimage/mutation-query/waiting_for_admission
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG >
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 10668 1 16K total
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG >
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > Stats:
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > permit_based_evictions: 0
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > time_based_evictions: 0
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > inactive_reads: 0
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > total_successful_reads: 7733930
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > total_failed_reads: 29761
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > total_reads_shed_due_to_overload: 0
< t:2023-07-23 05:47:06,339 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > total_reads_killed_due_to_kill_limit: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_admitted: 7734072
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_enqueued_for_admission: 2725102
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_enqueued_for_memory: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_admitted_immediately: 5049309
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_queued_because_ready_list: 1467987
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_queued_because_need_cpu_permits: 1257115
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_queued_because_memory_resources: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_queued_because_count_resources: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > reads_queued_with_eviction: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > total_permits: 7774411
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > current_permits: 10668
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > need_cpu_permits: 1
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > awaits_permits: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > disk_reads: 0
< t:2023-07-23 05:47:06,340 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > sstables_read: 0
@k0machi can you create a reproducer that would run this case, and have this nemesis in the in the list, to see if it's happening constantly, if it doesn't please raise it on scylla core.
@k0machi can you create a reproducer that would run this case, and have this nemesis in the in the list, to see if it's happening constantly, if it doesn't please raise it on scylla core.
Will do
@soyacz
Similar failure with 2023..1.2: https://argus.scylladb.com/test/ce70a8fd-1720-4757-b7aa-4fd8dd7d1655/runs?additionalRuns[]=070a0dcc-2d4b-4e7a-8492-81a2578cd7e5
c-s read with cl=ALL starts during_nemesis=AddRemoveDc:
< t:2023-10-02 14:21:18,386 f:stress_thread.py l:287 c:sdcm.stress_thread p:INFO > cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 nativ
e compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 -node 10.12.0.227,10.12.2.241,10.12.5.0,10.12.6.232,10.12.9.129,10.12.11.243,10.12.0.123 -errors skip-unsupported-columns
< t:2023-10-02 14:21:18,388 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-10-02 14:21:18.387: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=c145fd5a-2596-46a2-8cb2-239
515d5cab4: node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False)
Failed 2 minutes later:
2023-10-02 14:23:10.170: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=c145fd5a-2596-46a2-8cb2-239515d5cab4 during_nemesis=AddRemoveDc: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=168 node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False)
java.io.IOException: Operation x10 on key(s) [34343135393331343731]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency ALL (4 responses were required but only 3 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.
But after receiving CRITICAL error it also sent warning Some contact points don't match local data center
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-10-02 14:23:39.807: (CassandraStressEvent Severity.ERROR) period_type=end event_id=c145fd5a-2596-46a2-8cb2-239515d5cab4 during_nemesis=AddRemoveDc duration=2m21s: node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False)
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > errors:
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO >
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Stress command completed with bad status 1: WARN 14:21:26,977 Some contact points don't match local data center. Local DC = us-east_nemesis_dc.
Final removing process of the nemesis was started later:
< t:2023-10-02 14:25:17,132 f:common.py l:1726 c:utils p:DEBUG > Executing CQL 'DROP KEYSPACE IF EXISTS keyspace_new_dc' ...
There are reader_concurrency_semaphore
on the node3, waiting state for nemesis keyspace keyspace_new_dc.standard1/data-query/waiting
< t:2023-10-02 14:22:49,780 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-10-02T14:22:49+00:00 longevity-cdc-100gb-4h-2023-1-db-node-070a0dcc-3 !INFO | scylla[5740]: [shard 4] reader_concurrency_semaphore - (rate limiting dropped 33315 similar messages) Semaphore sl:default_read_concurrency_sem with 1/100 count and 18480/170875944 memory resources: timed out, dumping permit diagnostics:
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > permits count memory table/description/state
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1 1 18K cdc_test.test_table_preimage_postimage/data-query/active/used
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 5 0 0B keyspace_new_dc.standard1/data-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1517 0 0B cdc_test.test_table_postimage/mutation-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1561 0 0B cdc_test.test_table_preimage_postimage/mutation-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1457 0 0B cdc_test.test_table_preimage/mutation-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1895 0 0B cdc_test.test_table_preimage/data-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1830 0 0B cdc_test.test_table_preimage_postimage/data-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 374 0 0B cdc_test.test_table/data-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 1880 0 0B cdc_test.test_table_postimage/data-query/waiting
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG >
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 10520 1 18K total
< t:2023-10-02 14:22:49,781 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG >
@juliayakovlev
Let's open it in scylla core, seems like it's happening only on this CDC case.
We don't have any information that suggeyst it's SCT issue...
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-10-02 14:23:39.807: (CassandraStressEvent Severity.ERROR) period_type=end event_id=c145fd5a-2596-46a2-8cb2-239515d5cab4 during_nemesis=AddRemoveDc duration=2m21s: node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False) < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > errors: < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Stress command completed with bad status 1: WARN 14:21:26,977 Some contact points don't match local data center. Local DC = us-east_nemesis_dc.
@fruch What do you think about this warning? May be it may point on some problem?:
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-10-02 14:23:39.807: (CassandraStressEvent Severity.ERROR) period_type=end event_id=c145fd5a-2596-46a2-8cb2-239515d5cab4 during_nemesis=AddRemoveDc duration=2m21s: node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False)
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > errors:
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO >
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Stress command completed with bad status 1: WARN 14:21:26,977 Some contact points don't match local data center. Local DC = us-east_nemesis_dc.
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-10-02 14:23:39.807: (CassandraStressEvent Severity.ERROR) period_type=end event_id=c145fd5a-2596-46a2-8cb2-239515d5cab4 during_nemesis=AddRemoveDc duration=2m21s: node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False) < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > errors: < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Stress command completed with bad status 1: WARN 14:21:26,977 Some contact points don't match local data center. Local DC = us-east_nemesis_dc.
@fruch What do you think about this warning? May be it may point on some problem?:
< t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-10-02 14:23:39.807: (CassandraStressEvent Severity.ERROR) period_type=end event_id=c145fd5a-2596-46a2-8cb2-239515d5cab4 during_nemesis=AddRemoveDc duration=2m21s: node=Node longevity-cdc-100gb-4h-2023-1-loader-node-070a0dcc-2 [44.212.35.142 | 10.12.7.39] (seed: False) < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > errors: < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > < t:2023-10-02 14:23:39,808 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Stress command completed with bad status 1: WARN 14:21:26,977 Some contact points don't match local data center. Local DC = us-east_nemesis_dc.
No, it just means we passed to c-s all of the nodes, and some don't match the local DC, it shouldn't mean we should have queries failing and timing out.
@fruch I found that write load was run with cl=ALL and succeeded:
cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,us-east=3,us-east_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 -node 10.12.0.227,10.12.2.241,10.12.5.0,10.12.6.232,10.12.9.129,10.12.11.243,10.12.0.123 -errors skip-unsupported-columns
Replication strategy was defined as strategy=NetworkTopologyStrategy,us-east=3,us-east_nemesis_dc=1
But in read command the strategy was defined:
stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
I think we need to try to run the read with correct strategy. Before opening issue
@fruch I found that write load was run with cl=ALL and succeeded:
cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,us-east=3,us-east_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 -node 10.12.0.227,10.12.2.241,10.12.5.0,10.12.6.232,10.12.9.129,10.12.11.243,10.12.0.123 -errors skip-unsupported-columns
Replication strategy was defined as
strategy=NetworkTopologyStrategy,us-east=3,us-east_nemesis_dc=1
But in read command the strategy was defined:
stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
I think we need to try to run the read with correct strategy. Before opening issue
We should fix it,
But it's not so critical, since it's defined for the creation of the keyspace, so if it exists already, it doesn't change/alter it
So the stress is still not expected to fail
I ran the test with read load defined:
cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-north=3,eu-north_nemesis_dc=1) compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
cassandra-stress read no-warmup cl=QUORUM n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
The issue that I described here is not reproduced
https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/yulis-longevity-cdc-100gb-4h-test/10/
I ran the test with read load defined:
- with replication factor:
cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-north=3,eu-north_nemesis_dc=1) compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
- without replication factor (as it runs now)
cassandra-stress read no-warmup cl=QUORUM n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
The issue that I described here is not reproduced
https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/yulis-longevity-cdc-100gb-4h-test/10/
yes we know it's not 100% reproducible, i.e. it didn't happened on every run of this case
I ran the test with read load defined:
- with replication factor:
cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-north=3,eu-north_nemesis_dc=1) compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
- without replication factor (as it runs now)
cassandra-stress read no-warmup cl=QUORUM n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
The issue that I described here is not reproduced https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/yulis-longevity-cdc-100gb-4h-test/10/
yes we know it's not 100% reproducible, i.e. it didn't happened on every run of this case
I run with master. The problem is not reproduced.
During new datacenter creation, a list of all datacenters is created:
It is then passed to
_write_read_data_to_multi_dc_keyspace
as is, with assumption of the new datacenter being the last element inside the write function:The keyspace specified would already be created at this point, first it would be created with this:
Then altered via
This creates a problem as nodetool returns sorted output:
Which later leads to stress failure messages like this:
Write happens during the alteration via the context manager, read happens outside the context manager:
However, there's another read inside write after finishing the write at CL=ALL, which is what I think is failing:
self._verify_multi_dc_keyspace_data(consistency_level="ALL")