Closed soyacz closed 20 hours ago
We need to think what to do in the case when we terminate a node and expect it to be almost immediate operation. If it adds a significant amount of time to collect the log from that instance, it may affect the tests and probably many of the stats we collect.
We need to think what to do in the case when we terminate a node and expect it to be almost immediate operation. If it adds a significant amount of time to collect the log from that instance, it may affect the tests and probably many of the stats we collect.
collection of single node, shouldn't be that long. it's just a handful of ssh commands, we are not uploading it to s3 at that stage. (as far as I understand) we should measure it as we do this change.
anyhow we can just do the system.log part, if doing all of the collection is too much time.
Also if we have code that assume that taking a node down is immediately, that code part should be fixed (I don't think we have such delicate code)
@soyacz was trying it out in https://github.com/scylladb/scylla-cluster-tests/pull/6696
but for now we are dropping it, it's a bit more complicated than we estimated
@soyacz
this was raise again in the context of doing scylla code coverage runs with SCT, and that we'll need to make sure we dump and save the coverage information whenever we stop/kill a node.
where coverage information is stored?
where coverage information is stored?
we didn't yet have runs with it on SCT, we were discussing it with @eliransin, that recently pushed all of the support for that into scylla core.
it would be wherever we'll point it to (on the VM itself)
Let's try to have a quick implementation only for coverage collection. It's a 60MB file, should be fairly quick.
(We also need to make sure we dump the metrics before any violent kill (hard reboot, kill -9, etc), but that should be a different task).
When we decommision scylla node we evict VM instance. We collect
system.log
from db nodes and because VM's of decommisioned nodes don't exist, we skip it. But this log is valuable (have proper timestamps and is complete).Task is about collecting it before instance eviction.
SCT PR: