Closed tchaikov closed 4 weeks ago
@tchaikov
I would recommend giving it a ride, at least in gating, to see it's not breaking anything
I don't know how useful it would be, if we don't log the scylla processes when scylla starts
we really want to identify which test that process belong to, we know it's scylla...
does the parent process help? also, the timestamp should help.
@tchaikov
I would recommend giving it a ride, at least in gating, to see it's not breaking anything
running at https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/342/
I don't know how useful it would be, if we don't log the scylla processes when scylla starts we really want to identify which test that process belong to, we know it's scylla...
does the parent process help? also, the timestamp should help.
not much as well
in this situation you have a rouge scylla process up (my guess) you have 5 test processes running in the same time, dtest/ccm should have killed the cluster at the end of the test
each time a scylla process gonna start there this print:
21:04:15,154 741 ccm DEBUG cluster.py :754 | test_load_older_snapshot_and_refresh: node1: Starting scylla: args=['/jenkins/workspace/scylla-6.0/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-7hc6kw6e/test/node1/bin/scylla', '--options-file', '/jenkins/workspace/scylla-6.0/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-7hc6kw6e/test/node1/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.51.1', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.51.1', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
you need a way to correlate between test failure listing on a specific address (and you have the process id of who holds it) to the test started that scylla process, so you can investigate why clearing up of the process didn't work in that test.
@fruch i agree this is not a turn-key solution, but i think this is at least what we can do at this moment. if there is anything else i can do, i would be happy to make it happen.
@fruch i agree this is not a turn-key solution, but i think this is at least what we can do at this moment. if there is anything else i can do, i would be happy to make it happen.
I think adding a print of the scylla process, when we start it, would make this PR a bit more useful, and we would be able to hunt down which test have started the process
v2:
@fruch could you take another look?
v2:
- print out scylla instance's pid after launching it.
@fruch could you take another look?
I don't see this log print... it's on it's own commit ?
v2:
- print out scylla instance's pid after launching it.
@fruch could you take another look?
I don't see this log print... it's on it's own commit ?
sorry, it was on another branch. now included in this PR.
v2:
- print out scylla instance's pid after launching it.
@fruch could you take another look?
I don't see this log print... it's on it's own commit ?
sorry, it was on another branch. now included in this PR.
yeah looks better
when we start, for instance, a scylla node, we use
check_socket_available()
to see if the given address can be bound, if not an exception is raised. but we still have no idea who is listening on this address.in this change, we print out the process in this case, so that we can understand the problem better. in general, it's either an infra issue, or, it could be caused by a buggy test.