Collect system.log on db instance eviction

scylladb / scylla-cluster-tests

Tests for Scylla Clusters

GNU Affero General Public License v3.0

55 stars 93 forks source link

Collect system.log on db instance eviction #6688

Closed soyacz closed 20 hours ago

soyacz commented 11 months ago

When we decommision scylla node we evict VM instance. We collect system.log from db nodes and because VM's of decommisioned nodes don't exist, we skip it. But this log is valuable (have proper timestamps and is complete).

Task is about collecting it before instance eviction.

SCT PR:

https://github.com/scylladb/scylla-cluster-tests/pull/6696

roydahan commented 11 months ago

We need to think what to do in the case when we terminate a node and expect it to be almost immediate operation. If it adds a significant amount of time to collect the log from that instance, it may affect the tests and probably many of the stats we collect.

fruch commented 11 months ago

We need to think what to do in the case when we terminate a node and expect it to be almost immediate operation. If it adds a significant amount of time to collect the log from that instance, it may affect the tests and probably many of the stats we collect.

collection of single node, shouldn't be that long. it's just a handful of ssh commands, we are not uploading it to s3 at that stage. (as far as I understand) we should measure it as we do this change.

anyhow we can just do the system.log part, if doing all of the collection is too much time.

fruch commented 11 months ago

Also if we have code that assume that taking a node down is immediately, that code part should be fixed (I don't think we have such delicate code)

fruch commented 10 months ago

@soyacz was trying it out in https://github.com/scylladb/scylla-cluster-tests/pull/6696

but for now we are dropping it, it's a bit more complicated than we estimated

fruch commented 8 months ago

@soyacz

this was raise again in the context of doing scylla code coverage runs with SCT, and that we'll need to make sure we dump and save the coverage information whenever we stop/kill a node.

soyacz commented 8 months ago

where coverage information is stored?

fruch commented 8 months ago

where coverage information is stored?

we didn't yet have runs with it on SCT, we were discussing it with @eliransin, that recently pushed all of the support for that into scylla core.

it would be wherever we'll point it to (on the VM itself)

roydahan commented 8 months ago

Let's try to have a quick implementation only for coverage collection. It's a 60MB file, should be fairly quick.

(We also need to make sure we dump the metrics before any violent kill (hard reboot, kill -9, etc), but that should be a different task).