panic: runtime error: invalid memory address or nil pointer dereference

fruch commented 10 months ago

Issue description

[ ] This issue is a regression.
[x] It is unknown if this issue is a regression.

during large partition test, s-b is crashing on loader-1 and loader-5, like the following:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x68666c]

goroutine 664 [running]:
github.com/gocql/gocql.(*scyllaConnPicker).Pick(0xc0000f41e0, {0x7ebbf8?, 0xc0005cf278?}, {0x0, 0x0}, {0x0, 0x0})
    /go/pkg/mod/github.com/scylladb/gocql@v1.12.1-0.20240116102025-f614a51d47c7/scylla.go:373 +0xac
github.com/gocql/gocql.(*hostConnPool).Pick(0xc00023e000, {0x7ebbf8, 0xc0005cf278}, {0x0, 0x0}, {0x0, 0x0})
    /go/pkg/mod/github.com/scylladb/gocql@v1.12.1-0.20240116102025-f614a51d47c7/connectionpool.go:339 +0x199
github.com/gocql/gocql.(*queryExecutor).do(0xc00000e258, {0x7ec328, 0xc000018080}, {0x7edf40?, 0xc001e2ba40?}, 0xc000103040)
    /go/pkg/mod/github.com/scylladb/gocql@v1.12.1-0.20240116102025-f614a51d47c7/query_executor.go:128 +0x1d2
github.com/gocql/gocql.(*queryExecutor).executeQuery(0xc00000e258, {0x7edf40, 0xc001e2ba40})
    /go/pkg/mod/github.com/scylladb/gocql@v1.12.1-0.20240116102025-f614a51d47c7/query_executor.go:72 +0x110
github.com/gocql/gocql.(*Session).executeBatch(0xc000158000, 0xc001e2ba40)
    /go/pkg/mod/github.com/scylladb/gocql@v1.12.1-0.20240116102025-f614a51d47c7/session.go:785 +0x95
github.com/gocql/gocql.(*Session).ExecuteBatch(0x77480a?, 0x4e?)
    /go/pkg/mod/github.com/scylladb/gocql@v1.12.1-0.20240116102025-f614a51d47c7/session.go:796 +0x19
main.DoBatchedWrites.func1(0xc0005140c0)
    /go/scylla-bench-0.1.19/modes.go:429 +0x5d4
main.RunTest(0xc0005140c0, {0x7ecb98, 0xc0000d2210}, {0x7eb928, 0x9cf530}, 0xc0011a7eb0)
    /go/scylla-bench-0.1.19/modes.go:156 +0x1ac
main.DoBatchedWrites(0xc000158000, 0x0?, {0x7ecb98, 0xc0000d2210}, {0x7eb928, 0x9cf530})
    /go/scylla-bench-0.1.19/modes.go:396 +0x145
main.main.func4(0x0?, 0x0?, {0x7eb928, 0x9cf530})
    /go/scylla-bench-0.1.19/main.go:627 +0x105
main.RunConcurrently.func1(0x5)
    /go/scylla-bench-0.1.19/modes.go:85 +0x66
created by main.RunConcurrently
    /go/scylla-bench-0.1.19/modes.go:83 +0x19c

Impact

this kill SCT test in the middle

How frequently does it reproduce?

probably happen more, first time we noticed it clearly, since of new critical event for it

Installation details

Kernel Version: 5.15.0-1048-gcp Scylla version (or git commit hash): 5.5.0~dev-20240122.a48881801a74 with build-id f99679339b62abb10375a7570cd2d2f451430ce9

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-large-partitions-200k-pks-db-node-f580b12f-0-6 (34.139.118.166 | 10.142.0.3) (shards: 14)
longevity-large-partitions-200k-pks-db-node-f580b12f-0-5 (34.139.34.228 | 10.142.0.165) (shards: 14)
longevity-large-partitions-200k-pks-db-node-f580b12f-0-4 (34.73.194.157 | 10.142.0.164) (shards: 14)
longevity-large-partitions-200k-pks-db-node-f580b12f-0-3 (34.138.252.237 | 10.142.0.163) (shards: 14)
longevity-large-partitions-200k-pks-db-node-f580b12f-0-2 (35.196.180.45 | 10.142.0.162) (shards: 14)
longevity-large-partitions-200k-pks-db-node-f580b12f-0-1 (35.237.189.38 | 10.142.0.161) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-5-5-0-dev-x86-64-2024-01-23t02-22-32 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test Test id: f580b12f-1e90-4fff-899c-a40689aef17a Test name: scylla-staging/fruch/longevity-large-partition-200k-pks-4days-gce-test Test config file(s):

longevity-large-partition-200k_pks-4days.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor f580b12f-1e90-4fff-899c-a40689aef17a` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=f580b12f-1e90-4fff-899c-a40689aef17a) - Show all stored logs command: `$ hydra investigate show-logs f580b12f-1e90-4fff-899c-a40689aef17a` ## Logs: - **db-cluster-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/db-cluster-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/db-cluster-f580b12f.tar.gz) - **sct-runner-events-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-runner-events-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-runner-events-f580b12f.tar.gz) - **sct-f580b12f.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-f580b12f.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-f580b12f.log.tar.gz) - **loader-set-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/loader-set-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/loader-set-f580b12f.tar.gz) - **monitor-set-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/monitor-set-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/monitor-set-f580b12f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-large-partition-200k-pks-4days-gce-test/6/) [Argus](https://argus.scylladb.com/test/7539b0e9-7dfd-4ad7-ab9d-a4cde2807077/runs?additionalRuns[]=f580b12f-1e90-4fff-899c-a40689aef17a)

fruch commented 10 months ago

@vponomaryov

have you seen this kind of panic ?

fruch commented 10 months ago

it doesn't look like a memory leak:

and one got to 75% usage and the 2nd to %40

vponomaryov commented 10 months ago

@vponomaryov

have you seen this kind of panic ?

Exactly with this trace - no. But we already have 2 other bugs with panic:

So, scylla-bench has plenty of places with unsafe coding which leads to the invalid memory address or nil pointer dereference errors. We need some go expert here to fix such kind of issues.

piodul commented 10 months ago

This looks more like a gocql issue than scylla-bench issue.

cc: @avelanarius @sylwiaszunejko

vponomaryov commented 10 months ago

Got it reproduced using the scylla-bench-v0.1.19. Considering that the main diff between v0.1.18 and v0.1.19 is gocql driver version change, probably the bug was introduced somewhere between the versions which were used in above 2 scylla-bench versions.

So, I assume that @piodul is right. @fruch We may want to switch the SCT back to v0.1.18 back while this bug is not fixed. Because I expect this bug to happen pretty often.

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2024.1.0~rc4-20240117.0ba0261c79ef with build-id 05fd1aae7596802712754674280a0e6456758c11

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

longevity-large-partitions-3h-dev-db-node-b7afdfc5-5 (3.227.24.157 | 10.12.3.20) (shards: 7)
longevity-large-partitions-3h-dev-db-node-b7afdfc5-4 (54.87.223.57 | 10.12.3.6) (shards: 6)
longevity-large-partitions-3h-dev-db-node-b7afdfc5-3 (35.169.107.112 | 10.12.0.138) (shards: 6)
longevity-large-partitions-3h-dev-db-node-b7afdfc5-2 (44.222.141.29 | 10.12.1.30) (shards: 5)
longevity-large-partitions-3h-dev-db-node-b7afdfc5-1 (3.223.141.57 | 10.12.3.168) (shards: 5)

OS / Image: ami-083720a5da3ad2cb3 (aws: undefined_region)

Test: vp-longevity-large-partition-asymmetric-cluster-3h-test Test id: b7afdfc5-db92-4805-bc06-89a81f9341a1 Test name: scylla-staging/valerii/vp-longevity-large-partition-asymmetric-cluster-3h-test Test config file(s):

longevity-large-partition-3h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor b7afdfc5-db92-4805-bc06-89a81f9341a1` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=b7afdfc5-db92-4805-bc06-89a81f9341a1) - Show all stored logs command: `$ hydra investigate show-logs b7afdfc5-db92-4805-bc06-89a81f9341a1` ## Logs: *No logs captured during this run.* [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-large-partition-asymmetric-cluster-3h-test/2/) [Argus](https://argus.scylladb.com/test/116f89d3-7b3a-4cae-b8bc-5a5cb8306acf/runs?additionalRuns[]=b7afdfc5-db92-4805-bc06-89a81f9341a1)

fruch commented 10 months ago

@vponomaryov

Yeah it looks like topology change can easily trigger this crash.

Start with doing the revert in SCT

@roydahan @bhalevy, FYI since this the gocql driver with tablet support that seems to causing a regression

vponomaryov commented 10 months ago

@vponomaryov

Yeah it looks like topology change can easily trigger this crash.

Start with doing the revert in SCT

https://github.com/scylladb/scylla-cluster-tests/pull/7149

avelanarius commented 10 months ago

@sylwiaszunejko will look into this issue next week (after her PTO).

sylwiaszunejko commented 10 months ago

@fruch Could you recommend the easiest way to reproduce it locally?

Yeah it looks like topology change can easily trigger this crash.

I tried to add/remove node during running scylla-bench on cluster but it does not trigger the crash.

fruch commented 10 months ago

@fruch Could you recommend the easiest way to reproduce it locally?

this case is running multiple command from 5 different machines, so I would recommend similar commands also it's running five i4i.2xlarge as db nodes, so I would start nodes with higher smp

what were you using to run the local nodes ?

Yeah it looks like topology change can easily trigger this crash.

I tried to add/remove node during running scylla-bench on cluster but it does not trigger the crash.

the specific case we see I've seen it in: https://github.com/scylladb/scylla-cluster-tests/blob/c815a5185ce4cc4c066cf35aa7a86b1bb8bb9b55/sdcm/nemesis.py#L1501

this is node replacement by host_id https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/replace-dead-node.html

vponomaryov commented 10 months ago

@fruch Could you recommend the easiest way to reproduce it locally?

Yeah it looks like topology change can easily trigger this crash.

I tried to add/remove node during running scylla-bench on cluster but it does not trigger the crash.

In the test run mentioned by me above: https://github.com/scylladb/scylla-bench/issues/134#issuecomment-1908653910 Scylla-bench failed right after the scylla stop operation on one of the DB nodes.

Try to increase load.

sylwiaszunejko commented 10 months ago

@fruch I tried to run commands like this scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=25 -clustering-row-count=10000 -partition-offset=401 -clustering-row-size=uniform:10..1024 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=30s -retry-number=30 -retry-interval=80ms,1s -iterations 0 -duration=540m

what were you using to run the local nodes ?

I just have script to run nodes locally using ../build/dev/scylla with additional options.

fruch commented 10 months ago

@fruch I tried to run commands like this scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=25 -clustering-row-count=10000 -partition-offset=401 -clustering-row-size=uniform:10..1024 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=30s -retry-number=30 -retry-interval=80ms,1s -iterations 0 -duration=540m

what were you using to run the local nodes ?

I just have script to run nodes locally using ../build/dev/scylla with additional options.

we know we can reproduce it on SCT runs, we don't have any more data on how to reproduce it locally.

sylwiaszunejko commented 9 months ago

@fruch @vponomaryov I manage to execute SCT run, but it failed in different place than it should, I am not sure if the issue just does not reproduce every time or I set some parameters wrong (this is my first time using SCT). If you could take a look, that would be helpful, this is my run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/sylwia.szunejko/job/longevity-large-partition-200k-pks-4days-gce-test/2/

vponomaryov commented 9 months ago

@sylwiaszunejko

About the build#3 of your CI job:

You have following failure with scylla-bench:

11:20:10  < t:2024-02-02 09:20:09,002 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
11:20:10  < t:2024-02-02 09:20:09,002 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
11:20:10  < t:2024-02-02 09:20:09,002 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 2: fatal error: concurrent map iteration and map write
11:20:10  < t:2024-02-02 09:20:09,002 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
11:20:10  < t:2024-02-02 09:20:09,002 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > goroutine 560 [running]:
11:20:10  < t:2024-02-02 09:20:09,002 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > runtime.throw({0x76b4d

Looks like your scylla-bench image breaks in majority of stress commands leading to the not serious load:

Screenshot from 2024-02-02 13-06-26

The same is true about the build#2.

As a result, I am not sure the test run with such a small load will repro the scylla-bench bug...

sylwiaszunejko commented 9 months ago

@vponomaryov Do you have any idea why it breaks? The only difference is the gocql version used and in this gocql version I only added a few prints

vponomaryov commented 9 months ago

@vponomaryov Do you have any idea why it breaks? The only difference is the gocql version used and in this gocql version I only added a few prints

According to the error fatal error: concurrent map iteration and map write I assume there is some unsafe coding in the driver version you use.

I guess, the proper way to repro would be to keep the DB cluster, the loader nodes and monitoring one. Then trigger scylla-bench commands manually to load cluster at least for 50%, better more. Why keep? Because, even if your CI run catches the bug how will it differ from the already available test run results where it was reproduced?

But, I think the comment here worth fixing even before trying to repro the problem: https://github.com/scylladb/gocql/pull/137#pullrequestreview-1858845206

sylwiaszunejko commented 9 months ago

I hoped to see the more detailed logs or that the backtrace would differ, but you are right I will fix the comment first

fruch commented 9 months ago

@sylwiaszunejko @avelanarius

are we sure it's getting fixed by https://github.com/scylladb/gocql/pull/158 ?

sylwiaszunejko commented 9 months ago

@sylwiaszunejko @avelanarius

are we sure it's getting fixed by scylladb/gocql#158 ?

I think so, we just need to bump gocql version in scylla-bench again.

fruch commented 9 months ago

@sylwiaszunejko @avelanarius are we sure it's getting fixed by scylladb/gocql#158 ?

I think so, we just need to bump gocql version in scylla-bench again.

and release a new version (and build image for it) @vponomaryov can you help with that ?

sylwiaszunejko commented 9 months ago

I created PR with bumping the version https://github.com/scylladb/scylla-bench/pull/136

vponomaryov commented 9 months ago

@sylwiaszunejko @avelanarius are we sure it's getting fixed by scylladb/gocql#158 ?

I think so, we just need to bump gocql version in scylla-bench again.

and release a new version (and build image for it) @vponomaryov can you help with that ?

Tag: https://github.com/scylladb/scylla-bench/releases/tag/v0.1.20 Image: https://hub.docker.com/layers/scylladb/hydra-loaders/scylla-bench-v0.1.20/images/sha256-002c1bdadfa3df8b2f0b7d8e0bc10be5fa318bd0932f712b9ea33b056db151fb?context=explore

sylwiaszunejko commented 9 months ago

created PR to bump scylla-bench version in SCT https://github.com/scylladb/scylla-cluster-tests/pull/7184

scylladb / scylla-bench