Closed fruch closed 9 months ago
@vponomaryov
have you seen this kind of panic ?
it doesn't look like a memory leak:
and one got to 75% usage and the 2nd to %40
@vponomaryov
have you seen this kind of panic ?
Exactly with this trace - no. But we already have 2 other bugs with panic:
So, scylla-bench
has plenty of places with unsafe coding which leads to the invalid memory address or nil pointer dereference
errors.
We need some go expert here to fix such kind of issues.
This looks more like a gocql issue than scylla-bench issue.
cc: @avelanarius @sylwiaszunejko
Got it reproduced using the scylla-bench-v0.1.19
.
Considering that the main diff between v0.1.18
and v0.1.19
is gocql
driver version change,
probably the bug was introduced somewhere between the versions which were used in above 2 scylla-bench
versions.
So, I assume that @piodul is right.
@fruch We may want to switch the SCT back to v0.1.18
back while this bug is not fixed.
Because I expect this bug to happen pretty often.
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2024.1.0~rc4-20240117.0ba0261c79ef
with build-id 05fd1aae7596802712754674280a0e6456758c11
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-083720a5da3ad2cb3
(aws: undefined_region)
Test: vp-longevity-large-partition-asymmetric-cluster-3h-test
Test id: b7afdfc5-db92-4805-bc06-89a81f9341a1
Test name: scylla-staging/valerii/vp-longevity-large-partition-asymmetric-cluster-3h-test
Test config file(s):
@vponomaryov
Yeah it looks like topology change can easily trigger this crash.
Start with doing the revert in SCT
@roydahan @bhalevy, FYI since this the gocql driver with tablet support that seems to causing a regression
@vponomaryov
Yeah it looks like topology change can easily trigger this crash.
Start with doing the revert in SCT
@sylwiaszunejko will look into this issue next week (after her PTO).
@fruch Could you recommend the easiest way to reproduce it locally?
Yeah it looks like topology change can easily trigger this crash.
I tried to add/remove node during running scylla-bench on cluster but it does not trigger the crash.
@fruch Could you recommend the easiest way to reproduce it locally?
this case is running multiple command from 5 different machines, so I would recommend similar commands
also it's running five i4i.2xlarge
as db nodes, so I would start nodes with higher smp
what were you using to run the local nodes ?
Yeah it looks like topology change can easily trigger this crash.
I tried to add/remove node during running scylla-bench on cluster but it does not trigger the crash.
the specific case we see I've seen it in: https://github.com/scylladb/scylla-cluster-tests/blob/c815a5185ce4cc4c066cf35aa7a86b1bb8bb9b55/sdcm/nemesis.py#L1501
this is node replacement by host_id https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/replace-dead-node.html
@fruch Could you recommend the easiest way to reproduce it locally?
Yeah it looks like topology change can easily trigger this crash.
I tried to add/remove node during running scylla-bench on cluster but it does not trigger the crash.
In the test run mentioned by me above: https://github.com/scylladb/scylla-bench/issues/134#issuecomment-1908653910
Scylla-bench failed right after the scylla stop
operation on one of the DB nodes.
Try to increase load.
@fruch I tried to run commands like this scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=25 -clustering-row-count=10000 -partition-offset=401 -clustering-row-size=uniform:10..1024 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=30s -retry-number=30 -retry-interval=80ms,1s -iterations 0 -duration=540m
what were you using to run the local nodes ?
I just have script to run nodes locally using ../build/dev/scylla
with additional options.
@fruch I tried to run commands like this
scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=25 -clustering-row-count=10000 -partition-offset=401 -clustering-row-size=uniform:10..1024 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=30s -retry-number=30 -retry-interval=80ms,1s -iterations 0 -duration=540m
what were you using to run the local nodes ?
I just have script to run nodes locally using
../build/dev/scylla
with additional options.
we know we can reproduce it on SCT runs, we don't have any more data on how to reproduce it locally.
@fruch @vponomaryov I manage to execute SCT run, but it failed in different place than it should, I am not sure if the issue just does not reproduce every time or I set some parameters wrong (this is my first time using SCT). If you could take a look, that would be helpful, this is my run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/sylwia.szunejko/job/longevity-large-partition-200k-pks-4days-gce-test/2/
@sylwiaszunejko
About the build#3 of your CI job:
You have following failure with scylla-bench:
11:20:10 < t:2024-02-02 09:20:09,002 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > errors:
11:20:10 < t:2024-02-02 09:20:09,002 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO >
11:20:10 < t:2024-02-02 09:20:09,002 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Stress command completed with bad status 2: fatal error: concurrent map iteration and map write
11:20:10 < t:2024-02-02 09:20:09,002 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO >
11:20:10 < t:2024-02-02 09:20:09,002 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > goroutine 560 [running]:
11:20:10 < t:2024-02-02 09:20:09,002 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > runtime.throw({0x76b4d
Looks like your scylla-bench image breaks in majority of stress commands leading to the not serious load:
The same is true about the build#2.
As a result, I am not sure the test run with such a small load will repro the scylla-bench bug...
@vponomaryov Do you have any idea why it breaks? The only difference is the gocql version used and in this gocql version I only added a few prints
@vponomaryov Do you have any idea why it breaks? The only difference is the gocql version used and in this gocql version I only added a few prints
According to the error fatal error: concurrent map iteration and map write
I assume there is some unsafe coding in the driver version you use.
I guess, the proper way to repro would be to keep the DB cluster, the loader nodes and monitoring one. Then trigger scylla-bench commands manually to load cluster at least for 50%, better more. Why keep? Because, even if your CI run catches the bug how will it differ from the already available test run results where it was reproduced?
But, I think the comment here worth fixing even before trying to repro the problem: https://github.com/scylladb/gocql/pull/137#pullrequestreview-1858845206
I hoped to see the more detailed logs or that the backtrace would differ, but you are right I will fix the comment first
@sylwiaszunejko @avelanarius
are we sure it's getting fixed by https://github.com/scylladb/gocql/pull/158 ?
@sylwiaszunejko @avelanarius
are we sure it's getting fixed by scylladb/gocql#158 ?
I think so, we just need to bump gocql version in scylla-bench again.
@sylwiaszunejko @avelanarius are we sure it's getting fixed by scylladb/gocql#158 ?
I think so, we just need to bump gocql version in scylla-bench again.
and release a new version (and build image for it) @vponomaryov can you help with that ?
I created PR with bumping the version https://github.com/scylladb/scylla-bench/pull/136
@sylwiaszunejko @avelanarius are we sure it's getting fixed by scylladb/gocql#158 ?
I think so, we just need to bump gocql version in scylla-bench again.
and release a new version (and build image for it) @vponomaryov can you help with that ?
Tag: https://github.com/scylladb/scylla-bench/releases/tag/v0.1.20 Image: https://hub.docker.com/layers/scylladb/hydra-loaders/scylla-bench-v0.1.20/images/sha256-002c1bdadfa3df8b2f0b7d8e0bc10be5fa318bd0932f712b9ea33b056db151fb?context=explore
created PR to bump scylla-bench version in SCT https://github.com/scylladb/scylla-cluster-tests/pull/7184
Issue description
during large partition test, s-b is crashing on loader-1 and loader-5, like the following:
Impact
this kill SCT test in the middle
How frequently does it reproduce?
probably happen more, first time we noticed it clearly, since of new critical event for it
Installation details
Kernel Version: 5.15.0-1048-gcp Scylla version (or git commit hash):
5.5.0~dev-20240122.a48881801a74
with build-idf99679339b62abb10375a7570cd2d2f451430ce9
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-5-5-0-dev-x86-64-2024-01-23t02-22-32
(gce: undefined_region)Test:
longevity-large-partition-200k-pks-4days-gce-test
Test id:f580b12f-1e90-4fff-899c-a40689aef17a
Test name:scylla-staging/fruch/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor f580b12f-1e90-4fff-899c-a40689aef17a` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=f580b12f-1e90-4fff-899c-a40689aef17a) - Show all stored logs command: `$ hydra investigate show-logs f580b12f-1e90-4fff-899c-a40689aef17a` ## Logs: - **db-cluster-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/db-cluster-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/db-cluster-f580b12f.tar.gz) - **sct-runner-events-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-runner-events-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-runner-events-f580b12f.tar.gz) - **sct-f580b12f.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-f580b12f.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/sct-f580b12f.log.tar.gz) - **loader-set-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/loader-set-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/loader-set-f580b12f.tar.gz) - **monitor-set-f580b12f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/monitor-set-f580b12f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f580b12f-1e90-4fff-899c-a40689aef17a/20240123_192228/monitor-set-f580b12f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-large-partition-200k-pks-4days-gce-test/6/) [Argus](https://argus.scylladb.com/test/7539b0e9-7dfd-4ad7-ab9d-a4cde2807077/runs?additionalRuns[]=f580b12f-1e90-4fff-899c-a40689aef17a)