Streams: KCL is stuck in endless loop of "DescribeStream", not getting new shard-iterators, after decommission and adding new node

yarongilor commented 3 years ago

Installation details Scylla version (or git commit hash): 4.4.dev-0.20210104.d5da455d9 with build-id 56ab4a808a56049c86c5af7a0f4a680dc37f36d1 Cluster size: 6 nodes (i3.4xlarge) OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-018ec229a4231314c (aws: eu-west-1)

Test: longevity-alternator-streams-with-nemesis Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

[longevity-alternator-streams-100gb-4h.yaml] (https://github.com/scylladb/scylla-cluster-tests/blob/ce175e9a27d5305eed43e20d610d3fa72ad9c2de/test-cases/longevity/longevity-alternator-streams-100gb-4h.yaml)
[streams-with-nemesis.yaml] (https://github.com/scylladb/scylla-cluster-tests/blob/75fb6c9dd31de316886cf810a97c2e03f7e7ee72/configurations/alternator/streams-with-nemesis.yaml)

Issue description

====================================

scenario:

create source table: usertable_streams for copying its Streams data to: usertable_streams-dest.
configure KCL on streams of source to target table.
run YCSB insert load: using 2 loaders simultaneously, where each one writes to a different range.
run decommission nemesis to remove and add back a node for few hours.
complete load and wait for Streams completion.
run YCSB in read verification mode.

==> result: YCSB reported verification errors as below.

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor b71161a4-182a-4447-8a35-9e9c63e89b16 Show all stored logs command: $ hydra investigate show-logs b71161a4-182a-4447-8a35-9e9c63e89b16

Test id: b71161a4-182a-4447-8a35-9e9c63e89b16

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/b71161a4-182a-4447-8a35-9e9c63e89b16/20210105_233028/db-cluster-b71161a4.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/b71161a4-182a-4447-8a35-9e9c63e89b16/20210105_233028/loader-set-b71161a4.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/b71161a4-182a-4447-8a35-9e9c63e89b16/20210105_233028/monitor-set-b71161a4.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/b71161a4-182a-4447-8a35-9e9c63e89b16/20210105_233028/sct-runner-b71161a4.zip

Jenkins job URL

YCSB verification error:

/usr/local/openjdk-8/bin/java -cp /YCSB/dynamodb-binding/conf:/YCSB/conf:/YCSB/lib/HdrHistogram-2.1.4.jar:/YCSB/lib/aws-java-sdk-1.10.48.jar:/YCSB/lib/aws-java-sdk-api-gateway-1.10.48.jar:/YCSB/lib/aws-java-sdk-autoscaling-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudformation-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudfront-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudhsm-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudsearch-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudtrail-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudwatch-1.10.48.jar:/YCSB/lib/aws-java-sdk-cloudwatchmetrics-1.10.48.jar:/YCSB/lib/aws-java-sdk-codecommit-1.10.48.jar:/YCSB/lib/aws-java-sdk-codedeploy-1.10.48.jar:/YCSB/lib/aws-java-sdk-codepipeline-1.10.48.jar:/YCSB/lib/aws-java-sdk-cognitoidentity-1.10.48.jar:/YCSB/lib/aws-java-sdk-cognitosync-1.10.48.jar:/YCSB/lib/aws-java-sdk-config-1.10.48.jar:/YCSB/lib/aws-java-sdk-core-1.10.48.jar:/YCSB/lib/aws-java-sdk-datapipeline-1.10.48.jar:/YCSB/lib/aws-java-sdk-devicefarm-1.10.48.jar:/YCSB/lib/aws-java-sdk-directconnect-1.10.48.jar:/YCSB/lib/aws-java-sdk-directory-1.10.48.jar:/YCSB/lib/aws-java-sdk-dynamodb-1.10.48.jar:/YCSB/lib/aws-java-sdk-ec2-1.10.48.jar:/YCSB/lib/aws-java-sdk-ecr-1.10.48.jar:/YCSB/lib/aws-java-sdk-ecs-1.10.48.jar:/YCSB/lib/aws-java-sdk-efs-1.10.48.jar:/YCSB/lib/aws-java-sdk-elasticache-1.10.48.jar:/YCSB/lib/aws-java-sdk-elasticbeanstalk-1.10.48.jar:/YCSB/lib/aws-java-sdk-elasticloadbalancing-1.10.48.jar:/YCSB/lib/aws-java-sdk-elasticsearch-1.10.48.jar:/YCSB/lib/aws-java-sdk-elastictranscoder-1.10.48.jar:/YCSB/lib/aws-java-sdk-emr-1.10.48.jar:/YCSB/lib/aws-java-sdk-events-1.10.48.jar:/YCSB/lib/aws-java-sdk-glacier-1.10.48.jar:/YCSB/lib/aws-java-sdk-iam-1.10.48.jar:/YCSB/lib/aws-java-sdk-importexport-1.10.48.jar:/YCSB/lib/aws-java-sdk-inspector-1.10.48.jar:/YCSB/lib/aws-java-sdk-iot-1.10.48.jar:/YCSB/lib/aws-java-sdk-kinesis-1.10.48.jar:/YCSB/lib/aws-java-sdk-kms-1.10.48.jar:/YCSB/lib/aws-java-sdk-lambda-1.10.48.jar:/YCSB/lib/aws-java-sdk-logs-1.10.48.jar:/YCSB/lib/aws-java-sdk-machinelearning-1.10.48.jar:/YCSB/lib/aws-java-sdk-marketplacecommerceanalytics-1.10.48.jar:/YCSB/lib/aws-java-sdk-opsworks-1.10.48.jar:/YCSB/lib/aws-java-sdk-rds-1.10.48.jar:/YCSB/lib/aws-java-sdk-redshift-1.10.48.jar:/YCSB/lib/aws-java-sdk-route53-1.10.48.jar:/YCSB/lib/aws-java-sdk-s3-1.10.48.jar:/YCSB/lib/aws-java-sdk-ses-1.10.48.jar:/YCSB/lib/aws-java-sdk-simpledb-1.10.48.jar:/YCSB/lib/aws-java-sdk-simpleworkflow-1.10.48.jar:/YCSB/lib/aws-java-sdk-sns-1.10.48.jar:/YCSB/lib/aws-java-sdk-sqs-1.10.48.jar:/YCSB/lib/aws-java-sdk-ssm-1.10.48.jar:/YCSB/lib/aws-java-sdk-storagegateway-1.10.48.jar:/YCSB/lib/aws-java-sdk-sts-1.10.48.jar:/YCSB/lib/aws-java-sdk-support-1.10.48.jar:/YCSB/lib/aws-java-sdk-swf-libraries-1.10.48.jar:/YCSB/lib/aws-java-sdk-waf-1.10.48.jar:/YCSB/lib/aws-java-sdk-workspaces-1.10.48.jar:/YCSB/lib/commons-codec-1.6.jar:/YCSB/lib/commons-logging-1.1.3.jar:/YCSB/lib/core-0.18.0-SNAPSHOT.jar:/YCSB/lib/dynamodb-binding-0.18.0-SNAPSHOT.jar:/YCSB/lib/htrace-core4-4.1.0-incubating.jar:/YCSB/lib/httpclient-4.3.6.jar:/YCSB/lib/httpcore-4.3.3.jar:/YCSB/lib/jackson-annotations-2.5.0.jar:/YCSB/lib/jackson-core-2.5.3.jar:/YCSB/lib/jackson-core-asl-1.9.4.jar:/YCSB/lib/jackson-databind-2.5.3.jar:/YCSB/lib/jackson-mapper-asl-1.9.4.jar:/YCSB/lib/joda-time-2.8.1.jar:/YCSB/lib/log4j-1.2.17.jar:/YCSB/lib/cassandra-binding-0.18.0-SNAPSHOT.jar:/YCSB/lib/cassandra-driver-core-3.0.0.jar:/YCSB/lib/guava-16.0.1.jar:/YCSB/lib/metrics-core-3.1.2.jar:/YCSB/lib/netty-buffer-4.0.33.Final.jar:/YCSB/lib/netty-codec-4.0.33.Final.jar:/YCSB/lib/netty-common-4.0.33.Final.jar:/YCSB/lib/netty-handler-4.0.33.Final.jar:/YCSB/lib/netty-transport-4.0.33.Final.jar:/YCSB/lib/slf4j-api-1.7.25.jar site.ycsb.Client -db site.ycsb.db.DynamoDBClient -P workloads/workloadc -threads 80 -p readproportion=1.0 -p updateproportion=0.0 -p recordcount=78125000 -p fieldcount=1 -p fieldlength=128 -p operationcount=39062500 -p insertstart=39062500 -p insertcount=39062500 -p dataintegrity=true -p table=usertable_streams-dest -s -P /tmp/dynamodb.properties -p maxexecutiontime=39600 -t
Command line: -db site.ycsb.db.DynamoDBClient -P workloads/workloadc -threads 80 -p readproportion=1.0 -p updateproportion=0.0 -p recordcount=78125000 -p fieldcount=1 -p fieldlength=128 -p operationcount=39062500 -p insertstart=39062500 -p insertcount=39062500 -p dataintegrity=true -p table=usertable_streams-dest -s -P /tmp/dynamodb.properties -p maxexecutiontime=39600 -t
YCSB Client 0.18.0-SNAPSHOT

Loading workload...
Starting test.
Maximum execution time specified as: 39600 secs
2021-01-05 22:47:50:830 0 sec: 0 operations; est completion in 0 second 
DBWrapper: report latency for each error is false and specific error codes to track for latency are: []
DBWrapper: report latency for each error is false and specific error codes to track for latency are: []
...
2021-01-05 22:48:00:800 10 sec: 374332 operations; 37433.2 current ops/sec; est completion in 17 minutes [READ: Count=374359, Max=1088511, Min=1570, Avg=2057.57, 90=2049, 99=3033, 99.9=12247, 99.99=791039, Return(OK)=374416] [VERIFY: Count=374817, Max=439, Min=0, Avg=0.11, 90=0, 99=3, 99.9=11, 99.99=30, Return(OK)=13385, Return(ERROR)=361432] 
2021-01-05 22:48:10:800 20 sec: 807198 operations; 43286.6 current ops/sec; est completion in 15 minutes [READ: Count=432869, Max=11631, Min=1579, Avg=1845.47, 90=1986, 99=2481, 99.9=4707, 99.99=8071, Return(OK)=807336] [VERIFY: Count=432601, Max=55, Min=0, Avg=0.09, 90=0, 99=2, 99.9=8, 99.99=25, Return(OK)=29078, Return(ERROR)=778346] 
2021-01-05 22:48:20:800 30 sec: 1236011 operations; 42881.3 current ops/sec; est completion in 15 minutes [READ: Count=428793, Max=11831, Min=1579, Avg=1863.09, 90=1996, 99=3269, 99.9=5079, 99.99=8255, Return(OK)=1236044] [VERIFY: Count=428694, Max=59, Min=0, Avg=0.1, 90=0, 99=2, 99.9=8, 99.99=24, Return(OK)=44477, Return(ERROR)=1191639] 
2021-01-05 22:48:30:800 40 sec: 1664089 operations; 42807.8 current ops/sec; est completion in 14 minutes [READ: Count=428084, Max=14391, Min=1570, Avg=1866.2, 90=1999, 99=3373, 99.9=5079, 99.99=7867, Return(OK)=1664120] [VERIFY: Count=428078, Max=1902, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=24, Return(OK)=59939, Return(ERROR)=1604257] 
2021-01-05 22:48:40:800 50 sec: 2091345 operations; 42725.6 current ops/sec; est completion in 14 minutes [READ: Count=427255, Max=28671, Min=1585, Avg=1869.82, 90=2002, 99=3493, 99.9=5207, 99.99=10071, Return(OK)=2091390] [VERIFY: Count=427248, Max=45, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=23, Return(OK)=75286, Return(ERROR)=2016167] 
2021-01-05 22:48:50:800 60 sec: 2519518 operations; 42817.3 current ops/sec; est completion in 14 minutes [READ: Count=428166, Max=11343, Min=1593, Avg=1865.89, 90=1997, 99=3373, 99.9=5143, 99.99=7763, Return(OK)=2519543] [VERIFY: Count=428160, Max=53, Min=0, Avg=0.1, 90=0, 99=2, 99.9=8, 99.99=22, Return(OK)=90736, Return(ERROR)=2428862] 
2021-01-05 22:49:00:800 70 sec: 2947354 operations; 42783.6 current ops/sec; est completion in 14 minutes [READ: Count=427837, Max=14967, Min=1568, Avg=1867.34, 90=1999, 99=3469, 99.9=5083, 99.99=8051, Return(OK)=2947385] [VERIFY: Count=427843, Max=49, Min=0, Avg=0.1, 90=0, 99=2, 99.9=10, 99.99=26, Return(OK)=106437, Return(ERROR)=2841006] 
2021-01-05 22:49:10:800 80 sec: 3375700 operations; 42834.6 current ops/sec; est completion in 14 minutes [READ: Count=428346, Max=14519, Min=1573, Avg=1865.09, 90=1998, 99=3301, 99.9=5111, 99.99=7663, Return(OK)=3375731] [VERIFY: Count=428341, Max=47, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=23, Return(OK)=121971, Return(ERROR)=3253816] 
2021-01-05 22:49:20:800 90 sec: 3803268 operations; 42756.8 current ops/sec; est completion in 13 minutes [READ: Count=427567, Max=15407, Min=1593, Avg=1868.53, 90=1999, 99=3453, 99.9=5175, 99.99=8391, Return(OK)=3803305] [VERIFY: Count=427568, Max=47, Min=0, Avg=0.09, 90=0, 99=2, 99.9=8, 99.99=24, Return(OK)=137273, Return(ERROR)=3666082] 
2021-01-05 22:49:30:800 100 sec: 4231738 operations; 42847 current ops/sec; est completion in 13 minutes [READ: Count=428477, Max=11615, Min=1579, Avg=1864.56, 90=1995, 99=3345, 99.9=5191, 99.99=8327, Return(OK)=4231769] [VERIFY: Count=428470, Max=1968, Min=0, Avg=0.1, 90=0, 99=2, 99.9=10, 99.99=26, Return(OK)=152686, Return(ERROR)=4079141] 
2021-01-05 22:49:40:800 110 sec: 4659671 operations; 42793.3 current ops/sec; est completion in 13 minutes [READ: Count=427938, Max=32367, Min=1586, Avg=1866.62, 90=1998, 99=3433, 99.9=5135, 99.99=8107, Return(OK)=4659723] [VERIFY: Count=427948, Max=58, Min=0, Avg=0.1, 90=0, 99=2, 99.9=8, 99.99=23, Return(OK)=168171, Return(ERROR)=4491613] 
2021-01-05 22:49:50:800 120 sec: 5088575 operations; 42890.4 current ops/sec; est completion in 13 minutes [READ: Count=428896, Max=12591, Min=1585, Avg=1863.02, 90=1992, 99=3309, 99.9=5223, 99.99=8171, Return(OK)=5088593] [VERIFY: Count=428885, Max=60, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=24, Return(OK)=183548, Return(ERROR)=4905113] 
2021-01-05 22:50:00:800 130 sec: 5516543 operations; 42796.8 current ops/sec; est completion in 13 minutes [READ: Count=427968, Max=23903, Min=1581, Avg=1866.74, 90=1997, 99=3447, 99.9=5127, 99.99=6567, Return(OK)=5516573] [VERIFY: Count=427970, Max=50, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=24, Return(OK)=198783, Return(ERROR)=5317842] 
2021-01-05 22:50:10:800 140 sec: 5944466 operations; 42792.3 current ops/sec; est completion in 13 minutes [READ: Count=427913, Max=11823, Min=1598, Avg=1866.95, 90=1996, 99=3367, 99.9=5199, 99.99=8319, Return(OK)=5944480] [VERIFY: Count=427922, Max=56, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=24, Return(OK)=214216, Return(ERROR)=5730330] 
2021-01-05 22:50:20:800 150 sec: 6372224 operations; 42775.8 current ops/sec; est completion in 12 minutes [READ: Count=427773, Max=15175, Min=1590, Avg=1867.7, 90=1997, 99=3443, 99.9=5167, 99.99=8123, Return(OK)=6372250] [VERIFY: Count=427746, Max=65, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=25, Return(OK)=229470, Return(ERROR)=6142824] 
2021-01-05 22:50:30:800 160 sec: 6800395 operations; 42817.1 current ops/sec; est completion in 12 minutes [READ: Count=428170, Max=28671, Min=1586, Avg=1865.86, 90=1995, 99=3391, 99.9=5199, 99.99=9087, Return(OK)=6800413] [VERIFY: Count=428222, Max=60, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=24, Return(OK)=244830, Return(ERROR)=6555686] 
2021-01-05 22:50:40:800 170 sec: 7228422 operations; 42802.7 current ops/sec; est completion in 12 minutes [READ: Count=428018, Max=30351, Min=1586, Avg=1866.55, 90=1995, 99=3443, 99.9=5191, 99.99=8171, Return(OK)=7228435] [VERIFY: Count=427971, Max=56, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=24, Return(OK)=260296, Return(ERROR)=6968194] 
2021-01-05 22:50:50:800 180 sec: 7657251 operations; 42882.9 current ops/sec; est completion in 12 minutes [READ: Count=428835, Max=11695, Min=1579, Avg=1862.95, 90=1992, 99=3291, 99.9=5223, 99.99=7843, Return(OK)=7657274] [VERIFY: Count=428823, Max=60, Min=0, Avg=0.1, 90=0, 99=2, 99.9=9, 99.99=25, Return(OK)=275918, Return(ERROR)=7381394] 
2021-01-05 22:51:00:800 190 sec: 8085775 operations; 42852.4 current ops/sec; est completion in 12 minutes [READ: Count=428519, Max=12711, Min=1569, Avg=1864.29, 90=1992, 99=3401, 99.9=5175, 99.99=7123, Return(OK)=8085795] [VERIFY: Count=428547, Max=52, Min=0, Avg=0.1, 90=0, 99=2, 99.9=8, 99.99=24, Return(OK)=291344, Return(ERROR)=7794510] 
2021-01-05 22:51:10:800 200 sec: 8513972 operations; 42819.7 current ops/sec; est completion in 11 minutes [READ: Count=428202, Max=11807, Min=1578, Avg=1865.75, 90=1993, 99=3381, 99.9=5287, 99.99=9695, Return(OK)=8513992] [VERIFY: Count=428184, Max=2281, Min=0, Avg=0.1, 90=0, 99=2, 99.9=7, 99.99=23, Return(OK)=306537, Return(ERROR)=8207506]

the YCSB insert and read commands as found on SCT yaml file are:

  - >-
    hydra-kcl -t usertable_streams --timeout 32400
  - >-
    bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=78125000
    -p fieldcount=1 -p fieldlength=128 -p dataintegrity=true
    -p insertstart=0 -p insertcount=39062500 -p table=usertable_streams
  - >-
    bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=78125000
    -p fieldcount=1 -p fieldlength=128 -p dataintegrity=true
    -p insertstart=39062500 -p insertcount=39062500 -p table=usertable_streams
  - >-
    table_compare interval=120 ; src_table="alternator_usertable_streams".usertable_streams ; dst_table="alternator_usertable_streams-dest"."usertable_streams-dest"
stress_cmd:
  >-
  bin/ycsb run dynamodb -P workloads/workloadc -threads 80
  -p readproportion=1.0 -p updateproportion=0.0 -p recordcount=78125000
  -p fieldcount=1 -p fieldlength=128 -p operationcount=39062500
  -p insertstart=0 -p insertcount=39062500
  -p dataintegrity=true -p table=usertable_streams-dest

  >-
  bin/ycsb run dynamodb -P workloads/workloadc -threads 80
  -p readproportion=1.0 -p updateproportion=0.0 -p recordcount=78125000
  -p fieldcount=1 -p fieldlength=128 -p operationcount=39062500
  -p insertstart=39062500 -p insertcount=39062500
  -p dataintegrity=true -p table=usertable_streams-dest

round_robin: true

n_loaders: 3

some YCSB write temporary connection issues noticed during test, probably due to decommissions, but overall it reported successful run:

2021-01-05 16:50:24:907 11250 sec: 19533697 operations; 1042.7 current ops/sec; est completion in 3 hours 7 minutes [INSERT: Count=10426, Max=95039, Min=2696, Avg=9587.38, 90=12975, 99=45023, 99.9=73535, 99.99=89087, Return(OK)=19533699] 
11251213 [Thread-11] INFO  com.amazonaws.http.AmazonHttpClient  -Unable to execute HTTP request: alternator:8080 failed to respond
org.apache.http.NoHttpResponseException: alternator:8080 failed to respond
...
2021-01-05 21:05:23:592 26548 sec: 39062500 operations; 234.05 current ops/sec; [CLEANUP: Count=1, Max=1, Min=1, Avg=1, 90=1, 99=1, 99.9=1, 99.99=1] [INSERT: Count=2033, Max=52415, Min=2738, Avg=4268.11, 90=6199, 99=28815, 99.9=49119, 99.99=52415, Return(OK)=39062500] 
[OVERALL], RunTime(ms), 26548686
[OVERALL], Throughput(ops/sec), 1471.353422161835
[TOTAL_GCS_PS_Scavenge], Count, 63369
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 75137
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.28301589012729295
[TOTAL_GCS_PS_MarkSweep], Count, 19
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 272
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.001024532814919729
[TOTAL_GCs], Count, 63388
[TOTAL_GC_TIME], Time(ms), 75409
[TOTAL_GC_TIME_%], Time(%), 0.28404042294221266
[CLEANUP], Operations, 10
[CLEANUP], AverageLatency(us), 0.7
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 2
[CLEANUP], 95thPercentileLatency(us), 2
[CLEANUP], 99thPercentileLatency(us), 2
[INSERT], Operations, 39062500
[INSERT], AverageLatency(us), 6191.268768256
[INSERT], MinLatency(us), 2652
[INSERT], MaxLatency(us), 899071
[INSERT], 95thPercentileLatency(us), 14183
[INSERT], 99thPercentileLatency(us), 48831
[INSERT], Return=OK, 39062500

the second YCSB thread running the second range part of inserts, reported successful writing as well:

2021-01-05 22:47:48:497 32700 sec: 39062500 operations; 325.58 current ops/sec; [CLEANUP: Count=1, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0] [INSERT: Count=70, Max=3627, Min=2744, Avg=3070.17, 90=3331, 99=3539, 99.9=3627, 99.99=3627, Return(OK)=39062500] 
[OVERALL], RunTime(ms), 32700215
[OVERALL], Throughput(ops/sec), 1194.5640112763786
[TOTAL_GCS_PS_Scavenge], Count, 65574
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 80532
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.24627361012763985
[TOTAL_GCS_PS_MarkSweep], Count, 35
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 491
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.0015015191796139567
[TOTAL_GCs], Count, 65609
[TOTAL_GC_TIME], Time(ms), 81023
[TOTAL_GC_TIME_%], Time(%), 0.2477751293072538
[CLEANUP], Operations, 10
[CLEANUP], AverageLatency(us), 0.2
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 2
[CLEANUP], 95thPercentileLatency(us), 2
[CLEANUP], 99thPercentileLatency(us), 2
[INSERT], Operations, 39062500
[INSERT], AverageLatency(us), 6770.125490304
[INSERT], MinLatency(us), 2578
[INSERT], MaxLatency(us), 694783
[INSERT], 95thPercentileLatency(us), 21039
[INSERT], 99thPercentileLatency(us), 55487
[INSERT], Return=OK, 39062500

the issue was also reproduced on another run, using ChaosMonkey of multiple nemesis instead of Decommission nemesis:


Restore Monitor Stack command: `$ hydra investigate show-monitor c2e78390-87ed-4659-8f26-17375341a90b`
Show all stored logs command: `$ hydra investigate show-logs c2e78390-87ed-4659-8f26-17375341a90b`

Test id: `c2e78390-87ed-4659-8f26-17375341a90b`

Logs:
db-cluster - [https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/db-cluster-c2e78390.zip](https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/db-cluster-c2e78390.zip)
loader-set - [https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/loader-set-c2e78390.zip](https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/loader-set-c2e78390.zip)
monitor-set - [https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/monitor-set-c2e78390.zip](https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/monitor-set-c2e78390.zip)
sct-runner - [https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/sct-runner-c2e78390.zip](https://cloudius-jenkins-test.s3.amazonaws.com/c2e78390-87ed-4659-8f26-17375341a90b/20210126_041659/sct-runner-c2e78390.zip)

[Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/Longevity_yaron/job/longevity-alternator-streams-with-nemesis/12/)

fruch commented 3 years ago

issue is reproduced where running a single decommission only + using a smaller db instance type. thus there are less SMPs and less Stream shards.

Installation details Kernel version: 5.11.6-1.el7.elrepo.x86_64 Scylla version (or git commit hash): 4.5.dev-0.20210319.abab1d906c with build-id c60f1de6f6a94a2b9a6219edd82727f66aaf8c83 Cluster size: 6 nodes (i3.large) OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0c8a8490cf80a53a0 (aws: eu-west-1)

Test: longevity-alternator-streams-with-nemesis Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-alternator-streams-100gb-4h.yaml

streams-with-nemesis.yaml

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor a7453f7c-eddb-4777-9cfa-c7931bbcc2f5 Show all stored logs command: $ hydra investigate show-logs a7453f7c-eddb-4777-9cfa-c7931bbcc2f5

Test id: a7453f7c-eddb-4777-9cfa-c7931bbcc2f5

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_101815/grafana-screenshot-alternator-20210321_102430-alternator-streams-nemesis-stream-w-monitor-node-a7453f7c-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_101815/grafana-screenshot-longevity-alternator-streams-with-nemesis-scylla-per-server-metrics-nemesis-20210321_102155-alternator-streams-nemesis-stream-w-monitor-node-a7453f7c-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_101815/grafana-screenshot-overview-20210321_101815-alternator-streams-nemesis-stream-w-monitor-node-a7453f7c-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_150530/db-cluster-a7453f7c.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_150530/loader-set-a7453f7c.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_150530/monitor-set-a7453f7c.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/a7453f7c-eddb-4777-9cfa-c7931bbcc2f5/20210321_150530/sct-runner-a7453f7c.zip

Jenkins job URL

@yarongilor can you share the logs from hydra-kcl ? i.e. so we see what it's stuck on, and confirm the suspect issue is what we are seeing ?

slivne commented 3 years ago

Did you reproduce this with a smaller number of vnodes ?

yarongilor commented 3 years ago

Did you reproduce this with a smaller number of vnodes ?

it was reproduced where db instance type changed from 'i3.4xlarge' to 'i3.large'.

need to rerun in order to get the new kcl debug logs, so if this setting is good enough, i'll rerun again.

slivne commented 3 years ago

Please run with less vnodes it will still be hard

The parameter to change is num_tokens - that by default is set to 256 - please change it to be 16 and lets see what you get.

On Mon, Apr 5, 2021 at 10:30 AM yarongilor @.***> wrote:

Did you reproduce this with a smaller number of vnodes ?

it was reproduced where db instance type changed from 'i3.4xlarge' to 'i3.large'.

need to rerun in order to get the new kcl debug logs, so if this setting is good enough, i'll rerun again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/8012#issuecomment-813245348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2OCCAPBPV6654R5FWBMM3THFRH7ANCNFSM4W5H5G7A .

nyh commented 3 years ago

@yarongilor while reproducing on a smaller node is very good (it will be cheaper to run the test, and to live live nodes for debugging), @slivne's question was on vnodes, i.e., the number of tokens per node. This is the "num_tokens" configuration parameter of Scylla. Its default is (IIRC) 256, and can be set to much lower (even 1).

One of the theories was that there is a huge number of new "shards" (in the DynamoDB Streams sense - i.e., partitions in the CDC log) and that the KCL is taking a long time to process them instead of doing useful work (so it's a sort of livelock - KCL is working but not making any progress). The number of "shards" (CDC partitions) is proportional to the number of vnodes (num_tokens) so reducing it from 256 to (say) 1 can lower this amount of work, and maybe we'll see KCL making progress then.

yarongilor commented 3 years ago

Test reran using a minimum of 1 num_tokens and KCL extended debug configuration. from some reason it looks like no data is transferred to the target table. KCL log: https://drive.google.com/file/d/1RBaWDSyNIImpGfRRGSBXZsPFxjg4UEF8/view?usp=sharing test details:

Restore Monitor Stack command: $ hydra investigate show-monitor 1ad97d55-5bd1-4217-860f-878a8860ae40 Show all stored logs command: $ hydra investigate show-logs 1ad97d55-5bd1-4217-860f-878a8860ae40

Test id: 1ad97d55-5bd1-4217-860f-878a8860ae40

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/1ad97d55-5bd1-4217-860f-878a8860ae40/20210423_084122/db-cluster-1ad97d55.zip kubernetes - https://cloudius-jenkins-test.s3.amazonaws.com/1ad97d55-5bd1-4217-860f-878a8860ae40/20210423_084122/kubernetes-1ad97d55.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/1ad97d55-5bd1-4217-860f-878a8860ae40/20210423_084122/loader-set-1ad97d55.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/1ad97d55-5bd1-4217-860f-878a8860ae40/20210423_084122/monitor-set-1ad97d55.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/1ad97d55-5bd1-4217-860f-878a8860ae40/20210423_084122/sct-runner-1ad97d55.zip

Jenkins job URL

nyh commented 3 years ago

@yarongilor a few questions:

Did this test involve any topology changes, or not?
Did the topology changes happen before, or during, this 1 minute of KCL activity?
Are you sure that original table wasn't empty when you started this test?

The nice thing about the KCL log is that it includes all the DynamoDB API requests and responses (in a hard to use way, but we take what we get...) from zgrep -i "sending request: post" logfile.gz I see that in the beginning, from time 21:27:11 until 21:27:12, there are a bunch of PutItem, UpdateItem and GetItem operations, I'm guessing tha this is KCL setting up its own metadata. Then In 21:27:12 we start to see GetShardIterator (with TRIM_HORIZON option), and GetRecords - but as you noticed, not a single PutItem or UpdateItem being called after starting all these GetRecords stuff. It turns out that all GetRecords calls return an empty array [], e.g.,

2021-04-22 21:27:12.996 [ForkJoinPool-1-worker-25] DEBUG org.apache.http.wire - http-outgoing-25 << "{"Records":[],"NextShardIterator":"I7cb297b0-a3b1-11eb-938d-69696ee54350:00000000-0000-0000-0000-000000000000:H178fb6d5066:0fddc45f919a0139eeed7cb02c000031"}"

So there was indeed nothing to be written to the copy table.

Are you sure the original table really has data?

@elcallio did you ever test CDC with just a single token per node? Maybe we have bugs in this case, completely unrelated to the Alternator Streams we were looking for?

yarongilor commented 3 years ago

there was one single decommission at start of test. then another one after 10 hours.

~/Downloads/logs/sct-runner-1ad97d55$ grep -i 'decommission\|ycsb' events.log 
2021-04-22 21:29:08.022: (YcsbStressEvent Severity.NORMAL): type=start node=Node alternator-streams-nemesis-stream-w-loader-node-1ad97d55-2 [63.33.189.19 | 10.0.1.255] (seed: False)
stress_cmd=bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=78125000 -p fieldcount=1 -p fieldlength=128 -p dataintegrity=true -p insertstart=0 -p insertcount=39062500 -p table=usertable_streams -s  -P /tmp/dynamodb.properties -p maxexecutiontime=45600
2021-04-22 21:29:19.735: (YcsbStressEvent Severity.NORMAL): type=start node=Node alternator-streams-nemesis-stream-w-loader-node-1ad97d55-3 [34.246.193.196 | 10.0.0.86] (seed: False)
stress_cmd=bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=78125000 -p fieldcount=1 -p fieldlength=128 -p dataintegrity=true -p insertstart=39062500 -p insertcount=39062500 -p table=usertable_streams -s  -P /tmp/dynamodb.properties -p maxexecutiontime=45600
2021-04-22 21:44:26.404: (ClusterHealthValidatorEvent Severity.WARNING): type=NodesNemesis node=None message=There are more then expected nodes running nemesis: 10.0.3.238 (non-seed): DecommissionMonkey; 10.0.2.58 (non-seed): Compare tables size by cf-stats
2021-04-22 21:44:26.454: (DisruptionEvent Severity.NORMAL): type=Decommission subtype=start target_node=Node alternator-streams-nemesis-stream-w-db-node-1ad97d55-2 [3.251.79.138 | 10.0.3.238] (seed: False) duration=None

the topology changes happened after KCL log time:


~/Downloads/logs/sct-runner-1ad97d55$ grep -i 'decommission' events.log | grep type=
2021-04-22 21:44:26.404: (ClusterHealthValidatorEvent Severity.WARNING): type=NodesNemesis node=None message=There are more then expected nodes running nemesis: 10.0.3.238 (non-seed): DecommissionMonkey; 10.0.2.58 (non-seed): Compare tables size by cf-stats
2021-04-22 21:44:26.454: (DisruptionEvent Severity.NORMAL): type=Decommission subtype=start target_node=Node alternator-streams-nemesis-stream-w-db-node-1ad97d55-2 [3.251.79.138 | 10.0.3.238] (seed: False) duration=None
2021-04-22 22:09:59.701: (DisruptionEvent Severity.NORMAL): type=Decommission subtype=end target_node=Node alternator-streams-nemesis-stream-w-db-node-1ad97d55-2 [3.251.79.138 | 10.0.3.238] (seed: False) duration=1532
2021-04-23 08:11:42.177: (DisruptionEvent Severity.NORMAL): type=Decommission subtype=start target_node=Node alternator-streams-nemesis-stream-w-db-node-1ad97d55-7 [34.243.251.19 | 10.0.2.214] (seed: False) duration=None
2021-04-23 08:37:07.327: (DisruptionEvent Severity.NORMAL): type=Decommission subtype=end target_node=Node alternator-streams-nemesis-stream-w-db-node-1ad97d55-7 [34.243.251.19 | 10.0.2.214] (seed: False) duration=1524

yarongilor@yaron-pc:~/Downloads/logs/sct-runner-1ad97d55$ head /tmp/kcl-l0-c0-aba64515-5bfc-495a-ad44-3d4aabf7bd50.log.backup2 Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could not be reused, use --status for details

Task :compileJava UP-TO-DATE Task :processResources UP-TO-DATE Task :classes UP-TO-DATE SLF4J: Class path contains multiple SLF4J bindings.

Task :run 21:27:05,294 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml] SLF4J: Found binding in [jar:file:/root/.gradle/caches/modules-2/files-2.1/ch.qos.logback/logback-classic/1.2.3/7c4f3c474fb2c041d8028740440937705ebb473a/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-log4j12/1.7.5/6edffc576ce104ec769d954618764f39f0f0f10d/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] yarongilor@yaron-pc:~/Downloads/logs/sct-runner-1ad97d55$ tail /tmp/kcl-l0-c0-aba64515-5bfc-495a-ad44-3d4aabf7bd50.log.backup2 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-25] DEBUG c.a.s.k.c.lib.worker.ProcessTask - Kinesis didn't return any records for shard H178fb6d5066:50700000000000007be604722c000051 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-29] DEBUG org.apache.http.headers - http-outgoing-15 >> amz-sdk-invocation-id: 71a489bc-94bc-518a-0434-50f101fbd67c 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-29] DEBUG org.apache.http.headers - http-outgoing-15 >> amz-sdk-retry: 0/0/500 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-12] DEBUG org.apache.http.headers - http-outgoing-24 << Server: Seastar httpd 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-4] DEBUG com.amazonaws.request - Received successful response: 200, AWS Request ID: null 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-12] DEBUG o.a.h.impl.execchain.MainClientExec - Connection can be kept alive for 60000 MILLISECONDS 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-4] DEBUG com.amazonaws.requestId - x-amzn-RequestId: not available 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-4] DEBUG com.amazonaws.requestId - AWS Request ID: not available 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-16] DEBUG org.apache.http.wire - http-outgoing-31 << "Server: Seastar httpd[\r][\n]" 2021-04-22 21:28:20.137 [ForkJoinPool-1-worker-4] DEBUG c.a.s.k.c.lib.worker.ProcessTask - Kinesis didn't return any records for shard H178fb6d5066:0fddc45f919a0139eeed7cb02c000031

3. it makes sense the KCL thread started 2 minutes before YCSB stress:

~/Downloads/logs/sct-runner-1ad97d55$ grep YcsbStressEvent events.log | grep type=start 2021-04-22 21:29:08.022: (YcsbStressEvent Severity.NORMAL): type=start node=Node alternator-streams-nemesis-stream-w-loader-node-1ad97d55-2 [63.33.189.19 | 10.0.1.255] (seed: False) 2021-04-22 21:29:19.735: (YcsbStressEvent Severity.NORMAL): type=start node=Node alternator-streams-nemesis-stream-w-loader-node-1ad97d55-3 [34.246.193.196 | 10.0.0.86] (seed: False)

in any case there are SCT info events that shows the progress of source table increasing size like:

(InfoEvent Severity.NORMAL): message=[31661.12052989006/32400] == CompareTablesSizesThread: dst table/src table number of partitions: 0/2942534 ==


@nyh , if it sounds helpful - i can rerun the test with the default num_tokens and compare the result.

elcallio commented 3 years ago

Note that the streamdesc returned in the log @yarongilor posted separately is:

{
   "StreamDescription":{
      "StreamStatus":"ENABLED",
      "StreamArn":"S7cb297b0-a3b1-11eb-938d-69696ee54350",
      "StreamViewType":"NEW_IMAGE",
      "TableName":"usertable_streams",
      "KeySchema":[
         {
            "AttributeName":"p",
            "KeyType":"HASH"
         }
      ],
      "Shards":[
         {
            "ShardId":"H178fb6d5066:0c294a6425030839049d71a3c0000021",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:0c30000000000000f553cf24c8000021",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:0fddc45f919a0139eeed7cb02c000031",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:0fe000000000000069236a5ebc000031",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:415fa92159361d62c1caace5b8000041",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:41600000000000004902bb4ba0000041",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:506e7c26c97b3e912ef1ad70f0000051",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:50700000000000007be604722c000051",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:51be4755ff85087f6e510c7d80000001",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:51c0000000000000fea7723b34000001",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:991eb91afb9aae3a2cbda52a98000011",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         },
         {
            "ShardId":"H178fb6d5066:992000000000000001575622b4000011",
            "SequenceNumberRange":{
               "StartingSequenceNumber":"2552737690387355673456638568324694144"
            }
         }
      ]
   }
}

This means that:

We have a very small number of vnodes in this cluster
There was no topology change inside the TTL of CDC (24h)
There is no data in the shards above (seemingly, since we get none from reading)

At least the two last points seem wrong/strange(?)

elcallio commented 3 years ago

Note: The table can also be younger than the last topology change. Is this the data table being processed (usertable_streams)?

nyh commented 3 years ago

@yarongilor it seems to me that you gave us a KCL log spanning time 21:27:05 through 21:28:20, but the writer only started at 21:29:08, So obviously there will be no data in the table, and KCL will not read any data, so this log is not interesting and not surprising...

I think what we wanted is a KCL log covering the problem that you had, namely that after returning some data, KCL simply stops to return new data. If you want to limit KCL's log to one minute, pick a minute where KCL seems "hung" and not returning any new data - what we hoped to see was what it was doing at that time.

Everything could have been simpler if this test didn't take 10 hours. Did you try to make it much shorter? Also, does the test failure (KCL hang) depend on having a lot of node operations one after another, or does doing a node operation just once cause the problem? Even if the answer to the last answer is no (one topology change doesn't break the client) it would be good to know - it reduces the seriousness of the bug (typical users will not be doing a lot of topology changes in sequence).

nyh commented 3 years ago

* We have a very small number of vnodes in this cluster

Yes, this was a 1-vnode (per node) test. However, @yarongilor, please note that the whole point of the 1-vnode test was to check whether it makes the bug go away, or the bug remains. Well, you forgot to tell us what was the conclusion - is the bug dependent on having many vnodes, or also exists in the 1-vnode case? You don't need any KCL logs to answer that - it's just the question if your KCL workload manages to duplicate the table as your test expects, or not.

* There was no topology change inside the TTL of CDC (24h)

Yes, @yarongilor said that this KCL log is from before any data was inserted to the table, and 10 minutes before (!) any topology change. So this KCL log is not interesting.

* There is no data in the shards above (seemingly, since we get none from reading)
At least the two last points seem wrong/strange(?)

It means the KCL log is from the wrong time - before any data existed in the table and before any topology changes :-(

elcallio commented 3 years ago

Just to add some more useless info; I re-ran testing of hydra-kcl (i.e. alternator streams replication process), with a initial two-node cluster, but paused data generation to add a third node. Then adding a fourth node while actually streaming. Other than some race warnings (expected apparently) from the shard lease coordinator in kinesis, it worked flawlessly (if ever so slow). I have yet to see that the integration has any issue with cdc generations. Even when it splits the data into new shard sets.

yarongilor commented 3 years ago

Updating with this followup test according to the lighter minimal test Core asked for (#8012 reproduction).

In this test it seems KCL continued working properly after a single decommission and add-node operation.

the below 2 screenshots shows a scenario of:

The configuration had one vnode and one nemesis to run.
ycsb (shorter) stress started.
decommission + add-node nemesis ran once. (triggered a spike in DescribeStram / GetShardIterator )
ycsb stress continued another ~ 20 minutes and finished.
Streams read/write workload by kcl continued for ~ half an hour, updating the target table.

Therefore, the question now is what's the next step. Is it interesting to proceed by testing the default vnodes number? Screenshot from 2021-05-23 18-08-33

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.5.rc2-0.20210519.5651a20ba Cluster size: 6 nodes (i3.xlarge) Scylla running with shards number (live nodes): alternator-streams-nemesis-stream-w-db-node-05d76290-1 (35.176.133.203 | 10.0.0.219): 4 shards alternator-streams-nemesis-stream-w-db-node-05d76290-2 (18.132.47.2 | 10.0.3.28): 4 shards alternator-streams-nemesis-stream-w-db-node-05d76290-3 (35.178.144.251 | 10.0.2.91): 4 shards alternator-streams-nemesis-stream-w-db-node-05d76290-4 (18.133.238.233 | 10.0.2.240): 4 shards alternator-streams-nemesis-stream-w-db-node-05d76290-6 (18.130.232.82 | 10.0.1.25): 4 shards alternator-streams-nemesis-stream-w-db-node-05d76290-7 (3.10.116.23 | 10.0.0.10): 4 shards Scylla running with shards number (terminated nodes): alternator-streams-nemesis-stream-w-db-node-05d76290-5 (18.135.6.164 | 10.0.3.217): 4 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0866807c5fe93b793 (aws: eu-west-2)

Test: longevity-alternator-streams-with-nemesis Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 05d76290-5647-4b1e-941a-d099543aa9ce Show all stored logs command: $ hydra investigate show-logs 05d76290-5647-4b1e-941a-d099543aa9ce

Test id: 05d76290-5647-4b1e-941a-d099543aa9ce

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124344/grafana-screenshot-alternator-20210523_124512-alternator-streams-nemesis-stream-w-monitor-node-05d76290-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124344/grafana-screenshot-longevity-alternator-streams-with-nemesis-scylla-per-server-metrics-nemesis-20210523_124502-alternator-streams-nemesis-stream-w-monitor-node-05d76290-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124344/grafana-screenshot-overview-20210523_124344-alternator-streams-nemesis-stream-w-monitor-node-05d76290-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124723/db-cluster-05d76290.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124723/loader-set-05d76290.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124723/monitor-set-05d76290.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/05d76290-5647-4b1e-941a-d099543aa9ce/20210523_124723/sct-runner-05d76290.zip

Jenkins job URL

elcallio commented 3 years ago

Using a "normal" vnode set should work just fine, given a reasonable interval between topology changes. I think it would be good to run this with a standard node set and a topology change every 15-30 minutes or something. To ensure "slightly" stressed environment works.

yarongilor commented 3 years ago

The issue is reproduced with default number of vnodes. see in screenshot. live monitor is in: http://18.133.175.74:3000/d/alternator-4-5/alternator?orgId=1&refresh=30s&from=now-1h&to=now

| alternator-streams-nemesis-stream-s-monitor-node-71db9528-1 | eu-west-2a  | running | 71db9528-8aec-4712-9628-35e6732947a6 | yarongilor | Tue May 25 08:23:24 2021 |

Screenshot from 2021-05-25 12-36-27

elcallio commented 3 years ago

If I read this correctly, it is stuck doing describe stream, so what is the frequency of topology changes?

roydahan commented 3 years ago

AFAIU it, There is only one topology change and that's it.

elcallio commented 3 years ago

The live log would suggest something was happening? But not any longer. Did the cluster die? Also, how many shards/node did the cluster have (i.e. how many vnodes total?) I am not great at reading this from prometheus...

elcallio commented 3 years ago

Monitor says "down"...

yarongilor commented 3 years ago

right, it was set to destroy nodes on test end. monitor and logs could now be available in below details. @elcallio , if needed, test can be re-spinned, keeping the nodes.

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.5.rc2-0.20210524.b81919dbe with build-id f22c2ac3d227678bb260ee4447b5cdc3f47c6935 Cluster size: 6 nodes (i3.4xlarge) Scylla running with shards number (live nodes): alternator-streams-nemesis-stream-s-db-node-71db9528-1 (3.8.117.225 | 10.0.2.236): 14 shards alternator-streams-nemesis-stream-s-db-node-71db9528-2 (3.8.156.55 | 10.0.1.43): 14 shards alternator-streams-nemesis-stream-s-db-node-71db9528-3 (35.178.36.120 | 10.0.2.143): 14 shards alternator-streams-nemesis-stream-s-db-node-71db9528-4 (3.9.117.255 | 10.0.1.167): 14 shards alternator-streams-nemesis-stream-s-db-node-71db9528-5 (18.133.233.190 | 10.0.3.184): 14 shards alternator-streams-nemesis-stream-s-db-node-71db9528-7 (3.8.177.149 | 10.0.1.14): 14 shards Scylla running with shards number (terminated nodes): alternator-streams-nemesis-stream-s-db-node-71db9528-6 (18.134.151.180 | 10.0.1.49): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0bf97b5ecab8e99e7 (aws: eu-west-2)

Test: longevity-alternator-streams-with-nemesis Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 71db9528-8aec-4712-9628-35e6732947a6 Show all stored logs command: $ hydra investigate show-logs 71db9528-8aec-4712-9628-35e6732947a6

Test id: 71db9528-8aec-4712-9628-35e6732947a6

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102128/grafana-screenshot-alternator-20210525_102313-alternator-streams-nemesis-stream-s-monitor-node-71db9528-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102128/grafana-screenshot-longevity-alternator-streams-with-nemesis-scylla-per-server-metrics-nemesis-20210525_102302-alternator-streams-nemesis-stream-s-monitor-node-71db9528-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102128/grafana-screenshot-overview-20210525_102128-alternator-streams-nemesis-stream-s-monitor-node-71db9528-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102559/db-cluster-71db9528.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102559/loader-set-71db9528.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102559/monitor-set-71db9528.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/71db9528-8aec-4712-9628-35e6732947a6/20210525_102559/sct-runner-71db9528.zip

Jenkins job URL

elcallio commented 3 years ago

It looks like the test was still doing updateitem calls when it got terminated? Why was it terminated? What determined that it failed? Are we sure it did not just take a looong time?

yarongilor commented 3 years ago

Monitor says "down"...

Nodes are terminated but the monitor is alive and can show all test details. one unclear issue is grafana reports no getRecord or getShardIterator after nemesis, yet it does report continuing writes. the writes can be either YCSB or KCL , but it seems to continue after YCSB stress finished as well. @elcallio , does it make sense only updateItem continues, while no other above Streams query for all this time?

elcallio commented 3 years ago

Yeah, I would somewhat want to look at the running cluster once it gets stuck.

elcallio commented 3 years ago

One thing: I don't think the KCL in this test is running with periodic shard sync, but instead uses sync-on-end-of-shard. This is not very good for alternator, since we have a gazillion shards, most being empty. This would sort of explain why we seem to be doing an exponential # shard syncs (visavi actual shards existing). The kinesis library is very much written for situations where we have ~1-2 shards for a stream.

I sent a PR ages ago (borked by now I think) that changed KCL to use periodic shard sync.

yarongilor commented 3 years ago

Test is rerunning with periodic sync mode. still looks like getRecords and getShardIterator APIs are zeroed after decommission nemesis. only updateItem continues. live monitor: http://18.130.33.79:3000/d/alternator-4-5/alternator?orgId=1&refresh=10s&from=now-1h&to=now

| alternator-streams-nemesis-stream-s-monitor-node-da4dae8e-1 | eu-west-2a | running | da4dae8e-cf75-41f2-9d5b-fbe3343a6b44 | yarongilor | Wed May 26 13:05:42 2021 |

elcallio commented 3 years ago

Is there a kinesis log available?

yarongilor commented 3 years ago

Is there a kinesis log available?

ssh -l ubuntu -i ~/.ssh/scylla-qa-ec2 3.8.101.183 tail -f /home/ubuntu/sct-results/latest/alternator-streams-nemesis-stream-s-loader-set-da4dae8e/alternator-streams-nemesis-stream-s-loader-node-da4dae8e-1/kcl-l0-c0-154cb444-da38-47b8-b7d6-0c8311571005.log

elcallio commented 3 years ago

I just realized that the kinesis library uses hardcoded period of 1000ms for the sync task. I very much doubt it can finish even the ~1200 calls it requires for syncing in that time. So we might be hammering describestream queries instead. Still strange that it is apparently completely starved.

elcallio commented 3 years ago

Unfortunately, this was not run with debug logging, so it is not possible to say which sync belongs to which iteration. But it is quite obvious that it is doing shard sync continuously. Which explains the slowness. The test has not finished/succeeded in copying the data? How much has it copied, if any?

elcallio commented 3 years ago

Also: the logs indicate communication failures at one point (connection failed). Did the test kill a node somewhere, not just add?

yarongilor commented 3 years ago

Also: the logs indicate communication failures at one point (connection failed). Did the test kill a node somewhere, not just add?

The test decommissioned one node, then added a new one.

yarongilor commented 3 years ago

Unfortunately, this was not run with debug logging, so it is not possible to say which sync belongs to which iteration. But it is quite obvious that it is doing shard sync continuously. Which explains the slowness. The test has not finished/succeeded in copying the data? How much has it copied, if any?

The test successfully copied about a quarter of the data. then it stopped copying at all at the point of this decommission nemesis.

ubuntu@ip-10-0-0-94:~/sct-results/latest$ grep 'c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread' sct.log 
< t:2021-05-26 13:34:44,960 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 0/2986 ==
< t:2021-05-26 13:37:14,167 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 208583/288284 ==
< t:2021-05-26 13:39:51,250 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 590279/701374 ==
< t:2021-05-26 13:42:41,562 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 902205/1087922 ==
< t:2021-05-26 13:45:11,154 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1060908/1258886 ==
< t:2021-05-26 13:47:32,582 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/1577025 ==
< t:2021-05-26 13:50:07,980 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/2135199 ==
< t:2021-05-26 13:52:30,679 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/2305248 ==
< t:2021-05-26 13:55:07,709 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1190842/2877722 ==
< t:2021-05-26 13:57:35,909 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/3216292 ==
< t:2021-05-26 14:00:02,220 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1190842/3594297 ==
< t:2021-05-26 14:02:36,355 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1060908/3671122 ==
< t:2021-05-26 14:04:56,317 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1190842/4236765 ==
< t:2021-05-26 14:07:15,766 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/4205490 ==
< t:2021-05-26 14:09:48,338 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1001243/3855921 ==
< t:2021-05-26 14:12:15,490 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1050424/3759807 ==
< t:2021-05-26 14:14:24,621 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1001243/3855921 ==
< t:2021-05-26 14:16:34,578 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1190842/4242057 ==
< t:2021-05-26 14:18:45,956 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1001243/3855921 ==
< t:2021-05-26 14:20:57,986 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1060908/3925000 ==
< t:2021-05-26 14:23:21,491 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/4016580 ==
< t:2021-05-26 14:25:35,211 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1001243/3855921 ==
< t:2021-05-26 14:27:44,683 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1060908/3925000 ==
< t:2021-05-26 14:29:54,076 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1190842/4242057 ==
< t:2021-05-26 14:32:11,017 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/4205490 ==
< t:2021-05-26 14:34:20,627 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/4205490 ==
< t:2021-05-26 14:36:29,758 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1050424/3759807 ==
< t:2021-05-26 14:38:38,653 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/4016580 ==
< t:2021-05-26 14:40:48,344 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/4205490 ==
< t:2021-05-26 14:43:01,487 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1050424/3759807 ==
< t:2021-05-26 14:45:10,522 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/4016580 ==
< t:2021-05-26 14:47:19,579 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1190842/4242057 ==
< t:2021-05-26 14:49:33,470 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1050424/3759807 ==
< t:2021-05-26 14:51:42,332 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/4016580 ==
< t:2021-05-26 14:53:51,949 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1060908/3925000 ==
< t:2021-05-26 14:56:03,734 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/4205490 ==
< t:2021-05-26 14:58:13,235 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1192347/4205490 ==
< t:2021-05-26 15:00:24,624 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/4016580 ==
< t:2021-05-26 15:02:33,849 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1050424/3759807 ==
< t:2021-05-26 15:04:42,659 f:kcl_thread.py   l:133  c:sdcm.kcl_thread      p:INFO  > == CompareTablesSizesThread: dst table/src table number of partitions: 1031642/4016580 ==

elcallio commented 3 years ago

It looks like the node decommissioned was in fact the coordinator/endpoint node for alternator?

elcallio commented 3 years ago

I notice that the KCL code does not handle write failures. I.e. when the alternator endpoint died, it simply dropped records being written.

yarongilor commented 3 years ago

It looks like the node decommissioned was in fact the coordinator/endpoint node for alternator?

the logs seems to show 2 different nodes - the KCL target node is 10.0.1.224:

centos   15330  0.0  0.2 1078608 67824 ?       Ssl  13:32   0:00 docker exec -i d0e8362b5100bb818d1d4c9390b046c83228fcd1f1bcebc154b4e9987198d3f5 /bin/bash -c cd /hydra-kcl && ./gradlew run --args='  -t usertable_streams --timeout 32400 -e http://10.0.1.224:8080 '
root     15364  0.3  0.7 12126936 240656 ?     Ssl  13:32   0:06 /usr/local/openjdk-8/bin/java -Dorg.gradle.appname=gradlew -classpath /hydra-kcl/gradle/wrapper/gradle-wrapper.jar org.gradle.wrapper.GradleWrapperMain run --args=  -t usertable_streams --timeout 32400 -e http://10.0.1.224:8080

The decommissioned node is 10.0.0.166:

< t:2021-05-26 13:37:46,792 f:cluster.py      l:2766 c:sdcm.cluster_aws     p:DEBUG > Node alternator-streams-nemesis-stream-s-db-node-da4dae8e-3 [18.130.99.162 | 10.0.0.166] (seed: False): Command '/usr/bin/nodetool  decommission ' duration -> 69.36686605100022 s
< t:2021-05-26 13:37:46,825 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > 2021-05-26 13:37:46.804: (NodetoolEvent Severity.NORMAL): type=decommission subtype=end node=alternator-streams-nemesis-stream-s-db-node-da4dae8e-3 duration=69.36686605100022s

elcallio commented 3 years ago

Yes, but this code is afaik using the load balancing request handler, so we can very well hit the down node. Since there is a delay in updating liveness.

elcallio commented 3 years ago

How many thread is this running with (--threads argument to kcl)? It might be an idea to bounce this up from the default of num cpus. Since kinesis does some wacky things, and we might want to increase parallelism at the risk of contention...

elcallio commented 3 years ago

And by bounce up, I mean increase to maybe a few hundred or something. Though even here, kinesis is likely to overwhelm concurrency with describestream tasks.

elcallio commented 3 years ago

So this is not gonna work. Or at least not for anything beyond 1-3 cores or so. And even then, very slowly.

The heart of the problem is that kinesis is written rather terribly. It creates tasks for every shard (active or not - the ones with parents will "wait" for parent processing). Already here we can start to worry, since we typically have in the order of 10k shards or so in any given cluster. But of course it gets worse. When executed, these "shardconsumers" will attempt to read data. If they get data, all is well. It reports it and finishes, adding a new task to continue reading. But if it gets no data, it will issue a Thread.sleep to pause.

But wait, you say. That is not good when we are essentially using threads (in a workpool). We are squandering an execution resource. Possibly halting execution. Yes. Very much so. In fact, this is why KCL is so g**damn slow.

If run against a single-node cluster with one core, and setting thread pool size to something slightly larger than shard count, the test/program executes in ~2s. But once we hit thread pool limits and start sleeping, the execution slows to a crawl.

Now you ask, does this get worse with shard sync/cluster topology changes? Yes, yes it does. Very much so. Because either we run into the dreaded end-of-shard sync, which will issue num shards new syncs, who also waits for each other, but wholly ignores if data retrieved says we could skip reading from server or not, or if we run periodic sync, we run into the fact that the period is hardcoded. And pretty short a period at that. But wait, there is a parameter for this, is there not? Yes. But this causes another sleep to be issued. Blocking stuff even more.

Bottom line is: to have this work in any reliable manner, you need a thread pool that is num shards large + some extra. But since we have so many shards, this is not feasible. The app, and more over, IO (sockets) die a horrible death.

One option is to fork the library and rewrite it to use futures + scheduledexecutors to do proper job scheduling instead (like the CDC lib).

Unfortunately, because of arbitrary and also hardcoded limitations on identifiers in Dynamo streams (streampos, I am looking at you), the option of making shards a set of stream id:s does not work. We can't represent enough data in the field.

Should really have determined this before. As usual, the culprit is that while this library/program works, it cannot scale in neither cluster or topology history.

elcallio commented 3 years ago

So, update on looking at alternate solutions. The good news is that dynamodb sdk v2 has cleaned up their code. Not only is it now harder to read, and allocates more transient objects. But it actually fixed its execution model and is now fully future driven. Even better is that kinesis 2.x is also future based, and indeed can execute fully single threaded should need be. The bad news is that kinesis 2.x does not use/support dynamodb streams. They instead use kinesis streams. https://docs.aws.amazon.com/kinesis/latest/APIReference/API_Operations.html

So while we can write a dynamodb copier using the raw api, all the magic lease (stream pos) management you get with kinesis gets lost and needs to be reinvented from scratch.

nyh commented 3 years ago

The bad news is that kinesis 2.x does not use/support dynamodb streams. They instead use kinesis streams.

Wow, I didn't notice that this happened during the covid period: https://aws.amazon.com/about-aws/whats-new/2020/11/now-you-can-use-amazon-kinesis-data-streams-to-capture-item-level-changes-in-your-amazon-dynamodb-table/ I was sure that DynamoDB Streams was already almost identical to Kinesis, but I guess it wasn't identical enough? I'll open a new a new issue about that.

slivne commented 3 years ago

Hm ... - how fare are the APIs apart

are they totally different or (crossing my fingers)
its very close behavior just a different api call - but the functionality is the same / behaviour is the same

elcallio commented 3 years ago

In Kinesis 1.x, DynamoDB streams was not actually supported by kinesis itself, but through the project https://github.com/awslabs/dynamodb-streams-kinesis-adapter. This has not been updated to support 2,x, and according to the issues page, there is no plans to do so.

The kinesis API:s are semantically similar, but very different between versions. What is worse is that as for the places one would hook in a library/adaptor, they seem to almost intentionally have made it difficult so that you can't effectively do it without basically implementing the whole upper transport layer. If I were cynical, I would say this is an intentional strategy to make people move to using kinesis streams in their DynamoDB. Which I assume costs more.

Forking dynamodb-streams-kinesis-adapter and moving it to the 2.x code base is not a trivial task.

yarongilor commented 3 years ago

@roydahan , @slivne , i think we can close this issue, following @elcallio 's introduction of https://github.com/elcallio/kraken-dynamo. The new lib is already tested and encounters issues.

slivne commented 2 years ago

Pushing this out - lets go back to this on the next CDC call.

DoronArazii commented 1 year ago

Pushed for several times from version to version. Removing it from 5.2 to the large backlog.

@yarongilor please close it if it's not relevant anymore (due to your comment at 2021)

/Cc @roydahan

scylladb / scylladb

Streams: KCL is stuck in endless loop of "DescribeStream", not getting new shard-iterators, after decommission and adding new node #8012