scylladb / scylla-bench

42 stars 34 forks source link

"unable to create session: unable to discover protocol version: gocql: no response received from cassandra within timeout period" #95

Closed yarongilor closed 3 days ago

yarongilor commented 2 years ago

Installation details

Cluster size: 4 nodes (i3en.3xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: ami-07835983d717b1ea3 (aws: eu-west-1)

Test: longevity-large-partition-4days-test Test id: e8db664c-fe93-4ab9-9121-a2b9ed32d109 Test name: scylla-5.0/longevity/longevity-large-partition-4days-test Test config file(s):

Issue description

>>>>>>> scylla-bench v0.1.3 fails on start with an error of:

< t:2022-04-28 13:29:26,056 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/04/28 13:29:25 gocql: unable to create session: unable to discover protocol version: gocql: no response received from cassandra within timeout period
< t:2022-04-28 13:29:26,056 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "/$HOME/go/bin/scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=6 -clustering-row-count=10000000 -partition-offset=54 -clustering-row-size=2048 -concurrency=250 -rows-per-request=2000 -timeout=180s -connection-count 150 -error-at-row-limit 1000 -nodes 10.0.0.64"; Exit status: 1
< t:2022-04-28 13:29:26,057 f:base.py         l:150  c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2022/04/28 13:29:25 gocql: unable to create session: unable to discover protocol version: gocql: no response received from cassandra within timeout period

loader info:

< t:2022-04-28 12:50:25,758 f:cluster.py      l:3055 c:sdcm.cluster         p:INFO  > Cluster longevity-large-partitions-4d-5-0-loader-set-1ec8c3b8 (AMI: ['ami-0f1099eae729b80f1'] Type: c5n.2xlarge): Init nodes
< t:2022-04-28 12:50:26,344 f:cluster_aws.py  l:352  c:sdcm.cluster         p:INFO  > Cluster longevity-large-partitions-4d-5-0-loader-set-1ec8c3b8 (AMI: ['ami-0f1099eae729b80f1'] Type: c5n.2xlarge): Found provisioned instances = [ec2.Instance(id='i-0ed9c22d68ff40f84'), ec2.Instance(id='i-00f4982101d9f4cfe'), ec2.Instance(id='i-01b2012c36471bb1b')]
< t:2022-04-28 13:08:48,112 f:cluster.py      l:3482 c:sdcm.cluster         p:DEBUG > Class instance: Cluster longevity-large-partitions-4d-5-0-loader-set-1ec8c3b8 (AMI: ['ami-0f1099eae729b80f1'] Type: c5n.2xlarge)

It seems to run well on another longevity (200k-pk) that uses a different AMI and a different instance type:

< t:2022-04-13 07:53:31,070 f:cluster.py      l:3055 c:sdcm.cluster         p:INFO  > Cluster longevity-large-partitions-200k-pks-loader-set-e9f28ce7 (AMI: ['ami-0e8fcdd01f9f0389a'] Type: c5.xlarge): Init nodes
< t:2022-04-13 07:53:31,780 f:cluster_aws.py  l:352  c:sdcm.cluster         p:INFO  > Cluster longevity-large-partitions-200k-pks-loader-set-e9f28ce7 (AMI: ['ami-0e8fcdd01f9f0389a'] Type: c5.xlarge): Found provisioned instances = [ec2.Instance(id='i-0d5eb42e89189fdb6'), ec2.Instance(id='i-09109e2d0e23e20e8'), ec2.Instance(id='i-03b14390dde643d80'), ec2.Instance(id='i-075bdcc768c199194')]
< t:2022-04-13 08:16:21,521 f:cluster.py      l:3482 c:sdcm.cluster         p:DEBUG > Class instance: Cluster longevity-large-partitions-200k-pks-loader-set-e9f28ce7 (AMI: ['ami-0e8fcdd01f9f0389a'] Type: c5.xlarge)

<<<<<<<

Logs:

Jenkins job URL

yarongilor commented 2 years ago

@fgelcer , @roydahan , looks like it constantly fails in large-partitions-4d test. It doesn't sound like adding some retries will be usefull here. perhaps only playing with AMI/instance-type would make a difference (it may also be the same AMI on different regions), please advice.

juliayakovlev commented 2 years ago

longevity-twcs-3h-test, 2022.1.rc5 http://13.48.103.68/test_run/4f585541-af89-4da9-bd54-edf413637a9b

2 scylla-bench threads reported a lot of timeout errors in the time of running compactions:

2022/05/22 16:59:18 gocql: no response received from cassandra within timeout period
2022/05/22 16:59:20 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.

and failed to send load. The threads are stuck and didn't finished at time. One thread succeeded to finish

Screenshot from 2022-05-23 17-21-21

Installation details

Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb with build-id 5cecadda59974548befb4305363bf374631fc3e1 Cluster size: 5 nodes (i3.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0838dc54c055ad05a (aws: eu-north-1)

Test: longevity-twcs-3h-test Test id: 4f585541-af89-4da9-bd54-edf413637a9b Test name: enterprise-2022.1/longevity/longevity-twcs-3h-test Test config file(s):

Logs:

Jenkins job URL

juliayakovlev commented 2 years ago

The issue is reproduced with @aleksbykov 's fix https://github.com/scylladb/scylla-cluster-tests/pull/4834

It happened after major_compaction nemesis:

Screenshot from 2022-06-01 09-32-50

Writes were stopped almost and reads got more resources Same error:

2022/05/31 12:33:58 gocql: no response received from cassandra within timeout period

Screenshot from 2022-06-01 09-33-45

Test reached it's timeout as write s-b load has stuck and didn't finish

Installation details

Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb with build-id 5cecadda59974548befb4305363bf374631fc3e1 Cluster size: 5 nodes (i3.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0838dc54c055ad05a (aws: eu-north-1)

Test: longevity-twcs-3h-test Test id: 425600b6-f51b-4cf8-acc8-b2de85ce3118 Test name: enterprise-2022.1/longevity/longevity-twcs-3h-test Test config file(s):

Logs:

Jenkins job URL

roydahan commented 2 years ago

@aleksbykov / @juliayakovlev maybe major compaction for TWCS isn't efficient? Search issues for that or ask someone.

aleksbykov commented 2 years ago

Major compaction have to compact sstables per timewindow, if for some reason sstable contain several timewindow. May it is a reason why s-b failed, because major compaction took too many resources?

juliayakovlev commented 2 years ago

@roydahan @aleksbykov I suggest I'll open the issue in Scylla. In both runs write load failed AFTER major compaction nemesis. I didn't find similar issue

juliayakovlev commented 2 years ago

@roydahan @aleksbykov I found now that write load was finished with status 0 in this time (that we think that it's problem)

< t:2022-05-22 18:51:57,373 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Time (avg):      2h49m59.979177235s
< t:2022-05-22 18:51:57,373 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Total ops:       433555346
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Total rows:      433555346
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Total errors:    7202
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Operations/s:    +Inf
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Rows/s:          +Inf
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > raw latency :
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   max:           45.197819903s 
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   99.9th:        21.397503ms 
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   99th:          5.636095ms 
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   95th:          4.227071ms 
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   90th:          3.637247ms 
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   median:        1.474559ms 
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   mean:          2.176019ms
< t:2022-05-22 18:51:57,374 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > c-o fixed latency :
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   max:           6m0.240381951s 
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   99.9th:        6m0.240381951s 
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   99th:          6m0.240381951s 
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   95th:          6m0.240381951s 
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   90th:          6m0.240381951s 
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   median:        6m0.240381951s 
< t:2022-05-22 18:51:57,375 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   mean:          18.359473709s

< t:2022-05-22 18:51:57,879 f:base.py         l:140  c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/05/22 17:37:47 gocql: connection closed waiting for response
< t:2022-05-22 18:51:57,879 f:base.py         l:140  c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/05/22 17:37:47 gocql: connection closed waiting for response
< t:2022-05-22 18:51:57,903 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/$HOME/go/bin/scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=1653235307664157164 -connection-count 100 -max-rate 50000 --timeout 120s -duration=170m -error-at-row-limit 1000 -nodes 10.0.1.138" finished with status 0
< t:2022-05-22 18:51:57,905 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-05-22 18:51:57.903: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=569bc757-1db6-4c0b-b0fc-61529e109
cce duration=2h50m2s: node=Node longevity-twcs-3h-2022-1-loader-node-4f585541-1 [13.53.37.0 | 10.0.0.154] (seed: False)
< t:2022-05-22 18:51:57,905 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-co
unt=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=170m -error-at-row-limit 1000

< t:2022-05-22 18:51:57,910 f:common.py       l:2241 c:utils                p:ERROR > Value write can't be converted to float. Exception: could not convert string to float: 'write'
< t:2022-05-22 18:51:57,916 f:common.py       l:2241 c:utils                p:ERROR > Value timeseries can't be converted to float. Exception: could not convert string to float: 'timeseries'
< t:2022-05-22 18:51:57,916 f:common.py       l:2241 c:utils                p:ERROR > Value quorum can't be converted to float. Exception: could not convert string to float: 'quorum'
< t:2022-05-22 18:51:57,917 f:common.py       l:2241 c:utils                p:ERROR > Value Fixed(200) can't be converted to float. Exception: could not convert string to float: 'Fixed(200)'
< t:2022-05-22 18:51:57,917 f:common.py       l:2241 c:utils                p:ERROR > Value true can't be converted to float. Exception: could not convert string to float: 'true'
< t:2022-05-22 18:51:57,918 f:scylla_bench_thread.py l:269  c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Start timestamp` with value `1653235307664157164`
< t:2022-05-22 18:51:57,918 f:scylla_bench_thread.py l:269  c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Write rate` with value `125`
< t:2022-05-22 18:51:57,918 f:scylla_bench_thread.py l:269  c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Hdr memory consumption` with value `49179200 bytes`
< t:2022-05-22 18:51:57,927 f:scylla_bench_thread.py l:269  c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Total errors` with value `7202`
< t:2022-05-22 18:51:57,927 f:scylla_bench_thread.py l:269  c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `raw latency` with value ``
< t:2022-05-22 18:51:57,927 f:scylla_bench_thread.py l:269  c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `mean` with value `2.176019ms`
< t:2022-05-22 18:51:57,928 f:scylla_bench_thread.py l:137  c:sdcm.scylla_bench_thread p:DEBUG > Wait for stress threads results

Screenshot from 2022-06-01 18-11-17

Continue to check

yarongilor commented 2 years ago

I think we may have some missing information or an unclear issue what is the expected performance of scylla-bench - is timeout value correct? is the go-driver timeout value correct?

roydahan commented 2 years ago

@aleksbykov you worked on the TWCS longevity and command, please work with @juliayakovlev to check if there is issue with our s-b commands.

juliayakovlev commented 2 years ago

Reproducer of twcs 3h test with @aleksbykov recommendations:

db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/ad6a5407-459d-4141-904c-ddbaf51f70f1/20220607_113342/db-cluster-ad6a5407.tar.gz

sct - https://cloudius-jenkins-test.s3.amazonaws.com/ad6a5407-459d-4141-904c-ddbaf51f70f1/20220607_113342/sct-runner-ad6a5407.tar.gz

Job: https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/repr-longevity-twcs-3h-test/3/

I see that after write load finished, 2 read therads continue to run, but no read load.

Screenshot from 2022-06-08 11-29-53

no read load Screenshot from 2022-06-08 11-34-46

a lot of reads failures Screenshot from 2022-06-08 11-35-50

roydahan commented 2 years ago

@juliayakovlev I don't understand this part:

I see that after write load finished, 2 read therads continue to run, but no read load.

How do you see that the 2 read threads continue to run?

juliayakovlev commented 2 years ago

@juliayakovlev I don't understand this part:

I see that after write load finished, 2 read therads continue to run, but no read load.

How do you see that the 2 read threads continue to run?

@roydahan What I see - it's gocql timeouts are reported and final stress events were published later:

Screenshot from 2022-06-13 07-53-30

roydahan commented 3 days ago

Not sure if this is still relevant and if it's a tool issue or driver issue.

fruch commented 3 days ago

Not sure if this is still relevant and if it's a tool issue or driver issue.

It's from before we had retries configuration in s-b

There is nothing to do with this issue anymore.

Closing it for now