scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

nemesis should be able to report failures base on error event during it's execution #4926

Open juliayakovlev opened 2 years ago

juliayakovlev commented 2 years ago

Got coredump during disrupt_nodetool_enospc, but nemesis was passed

http://13.48.103.68/test/e924afae-c3b6-4e75-b4f2-05d5ead176aa/runs?additionalRuns[]=2bed6bc8-ab13-4f0c-85a9-8fb94a501ded

2022-06-22 02:28:24.745: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=90f7d36f-876d-46e3-844c-ee458659a2a7 node=Node longevity-mv-si-4d-5-0-db-node-2bed6bc8-4 [54.246.19.113 | 10.0.0.125] (seed: False)
2022-06-22 02:29:15.720: (DisruptionEvent Severity.NORMAL) period_type=end event_id=6494b0c5-5dd9-444d-a3fe-d8d0548b3519 duration=7m54s: nemesis_name=NodetoolEnospc target_node=Node longevity-mv-si-4d-5-0-db-node-2bed6bc8-4 [54.246.19.113 | 10.0.0.125] (seed: False)
Nemesis Information
Class: Sisyphus
Name: disrupt_nodetool_enospc
Status: Succeeded
Duration: 9 minutes, 34 seconds
Target Information
Name: longevity-mv-si-4d-5-0-db-node-2bed6bc8-4
Public IP: 54.246.19.113
Private IP: 10.0.0.125
State: running
Shards: 14

Installation details

Kernel Version: 5.13.0-1029-aws Scylla version (or git commit hash): 5.0~rc8-20220612.f28542a71 with build-id 85cf87619b93155a574647ec252ce5a043c7fe77 Cluster size: 5 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-088626a20ac6084a7 (aws: eu-west-1)

Test: longevity-mv-si-4days-test Test id: 2bed6bc8-ab13-4f0c-85a9-8fb94a501ded Test name: scylla-5.0/longevity/longevity-mv-si-4days-test Test config file(s):

Logs:

No logs captured during this run.

Jenkins job URL

fruch commented 2 years ago

@juliayakovlev why do you expect the nemesis to fail ? (I'm guessing the coredump happened while scylla is shutting down, and once we are finished with the ENOSPC nemesis, we start scylla back up, so the nemesis isn't aware of coredump taking place)

the main SCT issue in this case, is that we don't have enough room for saving the coredump. (it's a long standing issue for SCT)

juliayakovlev commented 2 years ago

@juliayakovlev why do you expect the nemesis to fail ? (I'm guessing the coredump happened while scylla is shutting down, and once we are finished with the ENOSPC nemesis, we start scylla back up, so the nemesis isn't aware of coredump taking place)

the main SCT issue in this case, is that we don't have enough room for saving the coredump. (it's a long standing issue for SCT)

@fruch I am not sure that aborting on shard and coredump are expected. I opened issue for that. Let see how dev will respond. If core is expected - we need to ignore it. Now we create error event

fruch commented 2 years ago

@juliayakovlev why do you expect the nemesis to fail ? (I'm guessing the coredump happened while scylla is shutting down, and once we are finished with the ENOSPC nemesis, we start scylla back up, so the nemesis isn't aware of coredump taking place) the main SCT issue in this case, is that we don't have enough room for saving the coredump. (it's a long standing issue for SCT)

@fruch I am not sure that aborting on shard and coredump are expected. I opened issue for that. Let see how dev will respond. If core is expected - we need to ignore it. Now we create error event

I've meant that nemesis doesn't error, is the expected behavior of SCT (currently)

As for scylla, yes every coredump is a bug. (that historically wasn't treated with much severity, cause it was hard to gather data of the crash)

amoskong commented 2 years ago

If scylla generated coredump in enospc situation, it's a scylla bug. The db error can't be acceptable.

fruch commented 2 years ago

If scylla generated coredump in enospc situation, it's a scylla bug. The db error can't be acceptable.

@amoskong you are coming back ? 😃

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 2 years with no activity. Remove stale label or comment or this will be closed in 2 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been stalled for 2 days with no activity.

fruch commented 1 month ago

@scylladb/qa-maintainers what do you think, if we mark nemesis as failed, if during their execution time, error events were raise ?

I'm not sure if it would help more, or confuse more...

soyacz commented 1 month ago

@scylladb/qa-maintainers what do you think, if we mark nemesis as failed, if during their execution time, error events were raise ?

I'm not sure if it would help more, or confuse more...

My opinion is to fail a nemesis in case of Error events: my assumption is that Scylla error (or SCT) is a nemesis failure. Also looking at Nemesis tab it would be clearer when error events showed up. There's also ES perspective showing which nemesis are causing us troubles.

fruch commented 1 month ago

@scylladb/qa-maintainers what do you think, if we mark nemesis as failed, if during their execution time, error events were raise ? I'm not sure if it would help more, or confuse more...

My opinion is to fail a nemesis in case of Error events: my assumption is that Scylla error (or SCT) is a nemesis failure. Also looking at Nemesis tab it would be clearer when error events showed up. There's also ES perspective showing which nemesis are causing us troubles.

on the other hand when it's parallel nemesis, it would be bit confusing.

but given it all, it might proven helpful (and maybe we won't enable it on all cases, i.e. parallel nemesis)

roydahan commented 1 month ago

My approach is a bit different, I'm differentiating between the nemesis operation to scylla. A failure of a nemesis is an operation that the nemesis tried and failed. A failure of scylla during a nemesis (which may be related to the nemesis and may not) is just a symptom but doesn't related specifically to the nemesis.

Having said that, if we fail the nemesis it will help us with the pass/fail statistics we keep in elasticsearch.

fruch commented 1 month ago

My approach is a bit different, I'm differentiating between the nemesis operation to scylla. A failure of a nemesis is an operation that the nemesis tried and failed. A failure of scylla during a nemesis (which may be related to the nemesis and may not) is just a symptom but doesn't related specifically to the nemesis.

That is correct, but on 90% of the cases it is connected.

So we might have few false positives, but in those cases the test would anyhow gonna fail, and we'll need to investigate what was going on. having 1 event plug on failing nemesis, won't be such a drastic change

Having said that, if we fail the nemesis it will help us with the pass/fail statistics we keep in elasticsearch.

It can also help investigations, we did it in a way in the opposite direction, we are marking on the event which nemesis was running when it fired.