Closed aleksbykov closed 2 years ago
AFAIK 4.6.rc2 includes tomeks patches fixing the issues we had in the cache with index caching fc312b3021aab6938dcf4416998343b38cd5e765 - so this is a new issue
This is blocking 4.6 from going out
Already backported to all vulnerable branches, removing "Backport candidate" label.
Installation details Kernel version:
5.4.0-1035-aws
Scylla version (or git commit hash):4.6.rc2-0.20220102.e8a1cfb6f with build-id 5d7b96e39c909424e8224207a162fc2c82b67214
Cluster size: 6 nodes (i3.4xlarge) Scylla running with shards number (live nodes): longevity-cdc-3d-400gb-4-6-db-node-760d9ed7-1 (54.229.69.26 | 10.0.2.7): 14 shards longevity-cdc-3d-400gb-4-6-db-node-760d9ed7-2 (34.249.197.95 | 10.0.3.238): 14 shards longevity-cdc-3d-400gb-4-6-db-node-760d9ed7-3 (54.229.235.1 | 10.0.0.222): 14 shards longevity-cdc-3d-400gb-4-6-db-node-760d9ed7-4 (54.195.157.71 | 10.0.2.27): 14 shards longevity-cdc-3d-400gb-4-6-db-node-760d9ed7-5 (54.194.152.213 | 10.0.0.131): 14 shards longevity-cdc-3d-400gb-4-6-db-node-760d9ed7-6 (52.49.242.12 | 10.0.0.95): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-07d92096e7ab05aae
(aws: eu-west-1)Test:
longevity-cdc-3d-400gb-test
Test name:longevity_test.LongevityTest.test_custom_time
Test config file(s):Issue description During Job cdc-3d.
Nemesis StartStopScrubCompaction run on node6. The nemesis start scrubbing and then stop it. Before starting scrubbing, it is waiting that all previous compactions finished. Cluster have 4 tables with enabled cdc features with varios options, delta, preimage + delta, postimage + delta, preimage + delta + postimage Compactions have been running all the time while nemesis is waiting. But right after that on node4 next coredumpt triggered:
Coredump happened on node 4:
Decoded bactrace:
After that coredump, scylla was restarted. Scylla started and node status was UN.
But then next info messages appeared:
And after that scylla start stopping:
And goes down with error:
Restore Monitor Stack command:
$ hydra investigate show-monitor 760d9ed7-10f3-4912-ba85-fbc3918db0ab
Restore monitor on AWS instance using Jenkins job Show all stored logs command:$ hydra investigate show-logs 760d9ed7-10f3-4912-ba85-fbc3918db0ab
Test id:
760d9ed7-10f3-4912-ba85-fbc3918db0ab
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/760d9ed7-10f3-4912-ba85-fbc3918db0ab/20220113_155406/grafana-screenshot-longevity-cdc-3d-400gb-test-scylla-per-server-metrics-nemesis-20220113_155640-longevity-cdc-3d-400gb-4-6-monitor-node-760d9ed7-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/760d9ed7-10f3-4912-ba85-fbc3918db0ab/20220113_155406/grafana-screenshot-overview-20220113_155406-longevity-cdc-3d-400gb-4-6-monitor-node-760d9ed7-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/760d9ed7-10f3-4912-ba85-fbc3918db0ab/20220113_160435/db-cluster-760d9ed7.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/760d9ed7-10f3-4912-ba85-fbc3918db0ab/20220113_160435/loader-set-760d9ed7.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/760d9ed7-10f3-4912-ba85-fbc3918db0ab/20220113_160435/monitor-set-760d9ed7.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/760d9ed7-10f3-4912-ba85-fbc3918db0ab/20220113_160435/sct-runner-760d9ed7.tar.gz
Jenkins job URL