Open juliayakovlev opened 2 years ago
@juliayakovlev why do you expect the nemesis to fail ? (I'm guessing the coredump happened while scylla is shutting down, and once we are finished with the ENOSPC nemesis, we start scylla back up, so the nemesis isn't aware of coredump taking place)
the main SCT issue in this case, is that we don't have enough room for saving the coredump. (it's a long standing issue for SCT)
@juliayakovlev why do you expect the nemesis to fail ? (I'm guessing the coredump happened while scylla is shutting down, and once we are finished with the ENOSPC nemesis, we start scylla back up, so the nemesis isn't aware of coredump taking place)
the main SCT issue in this case, is that we don't have enough room for saving the coredump. (it's a long standing issue for SCT)
@fruch I am not sure that aborting on shard and coredump are expected. I opened issue for that. Let see how dev will respond. If core is expected - we need to ignore it. Now we create error event
@juliayakovlev why do you expect the nemesis to fail ? (I'm guessing the coredump happened while scylla is shutting down, and once we are finished with the ENOSPC nemesis, we start scylla back up, so the nemesis isn't aware of coredump taking place) the main SCT issue in this case, is that we don't have enough room for saving the coredump. (it's a long standing issue for SCT)
@fruch I am not sure that aborting on shard and coredump are expected. I opened issue for that. Let see how dev will respond. If core is expected - we need to ignore it. Now we create error event
I've meant that nemesis doesn't error, is the expected behavior of SCT (currently)
As for scylla, yes every coredump is a bug. (that historically wasn't treated with much severity, cause it was hard to gather data of the crash)
If scylla generated coredump in enospc situation, it's a scylla bug. The db error can't be acceptable.
If scylla generated coredump in enospc situation, it's a scylla bug. The db error can't be acceptable.
@amoskong you are coming back ? 😃
This issue is stale because it has been open 2 years with no activity. Remove stale label or comment or this will be closed in 2 days.
This issue was closed because it has been stalled for 2 days with no activity.
@scylladb/qa-maintainers what do you think, if we mark nemesis as failed, if during their execution time, error events were raise ?
I'm not sure if it would help more, or confuse more...
@scylladb/qa-maintainers what do you think, if we mark nemesis as failed, if during their execution time, error events were raise ?
I'm not sure if it would help more, or confuse more...
My opinion is to fail a nemesis in case of Error events: my assumption is that Scylla error (or SCT) is a nemesis failure. Also looking at Nemesis tab it would be clearer when error events showed up. There's also ES perspective showing which nemesis are causing us troubles.
@scylladb/qa-maintainers what do you think, if we mark nemesis as failed, if during their execution time, error events were raise ? I'm not sure if it would help more, or confuse more...
My opinion is to fail a nemesis in case of Error events: my assumption is that Scylla error (or SCT) is a nemesis failure. Also looking at Nemesis tab it would be clearer when error events showed up. There's also ES perspective showing which nemesis are causing us troubles.
on the other hand when it's parallel nemesis, it would be bit confusing.
but given it all, it might proven helpful (and maybe we won't enable it on all cases, i.e. parallel nemesis)
My approach is a bit different, I'm differentiating between the nemesis operation to scylla. A failure of a nemesis is an operation that the nemesis tried and failed. A failure of scylla during a nemesis (which may be related to the nemesis and may not) is just a symptom but doesn't related specifically to the nemesis.
Having said that, if we fail the nemesis it will help us with the pass/fail statistics we keep in elasticsearch.
My approach is a bit different, I'm differentiating between the nemesis operation to scylla. A failure of a nemesis is an operation that the nemesis tried and failed. A failure of scylla during a nemesis (which may be related to the nemesis and may not) is just a symptom but doesn't related specifically to the nemesis.
That is correct, but on 90% of the cases it is connected.
So we might have few false positives, but in those cases the test would anyhow gonna fail, and we'll need to investigate what was going on. having 1 event plug on failing nemesis, won't be such a drastic change
Having said that, if we fail the nemesis it will help us with the pass/fail statistics we keep in elasticsearch.
It can also help investigations, we did it in a way in the opposite direction, we are marking on the event which nemesis was running when it fired.
Got coredump during disrupt_nodetool_enospc, but nemesis was passed
http://13.48.103.68/test/e924afae-c3b6-4e75-b4f2-05d5ead176aa/runs?additionalRuns[]=2bed6bc8-ab13-4f0c-85a9-8fb94a501ded
Installation details
Kernel Version: 5.13.0-1029-aws Scylla version (or git commit hash):
5.0~rc8-20220612.f28542a71
with build-id85cf87619b93155a574647ec252ce5a043c7fe77
Cluster size: 5 nodes (i3.4xlarge)Scylla Nodes used in this run:
OS / Image:
ami-088626a20ac6084a7
(aws: eu-west-1)Test:
longevity-mv-si-4days-test
Test id:2bed6bc8-ab13-4f0c-85a9-8fb94a501ded
Test name:scylla-5.0/longevity/longevity-mv-si-4days-test
Test config file(s):longevity-mv-si-4days.yaml
Restore Monitor Stack command:
$ hydra investigate show-monitor 2bed6bc8-ab13-4f0c-85a9-8fb94a501ded
Restore monitor on AWS instance using Jenkins job
Show all stored logs command:
$ hydra investigate show-logs 2bed6bc8-ab13-4f0c-85a9-8fb94a501ded
Logs:
No logs captured during this run.
Jenkins job URL