scylladb / scylla-ccm

Cassandra Cluster Manager, modified for Scylla
Apache License 2.0
20 stars 60 forks source link

ccmlib/common: print out process if it's listen given address #579

Closed tchaikov closed 4 weeks ago

tchaikov commented 4 weeks ago

when we start, for instance, a scylla node, we use check_socket_available() to see if the given address can be bound, if not an exception is raised. but we still have no idea who is listening on this address.

in this change, we print out the process in this case, so that we can understand the problem better. in general, it's either an infra issue, or, it could be caused by a buggy test.

tchaikov commented 4 weeks ago

this change was inspired by https://jenkins.scylladb.com/job/scylla-6.0/job/next/47/testReport/junit/nodetool_additional_test/TestNodetool/Build___x86___dtest_with_topology_changes___test_status/

fruch commented 4 weeks ago

@tchaikov

I would recommend giving it a ride, at least in gating, to see it's not breaking anything

tchaikov commented 4 weeks ago

I don't know how useful it would be, if we don't log the scylla processes when scylla starts

we really want to identify which test that process belong to, we know it's scylla...

does the parent process help? also, the timestamp should help.

tchaikov commented 4 weeks ago

@tchaikov

I would recommend giving it a ride, at least in gating, to see it's not breaking anything

running at https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/342/

fruch commented 4 weeks ago

I don't know how useful it would be, if we don't log the scylla processes when scylla starts we really want to identify which test that process belong to, we know it's scylla...

does the parent process help? also, the timestamp should help.

not much as well

in this situation you have a rouge scylla process up (my guess) you have 5 test processes running in the same time, dtest/ccm should have killed the cluster at the end of the test

each time a scylla process gonna start there this print:

21:04:15,154 741     ccm                            DEBUG    cluster.py          :754  | test_load_older_snapshot_and_refresh: node1: Starting scylla: args=['/jenkins/workspace/scylla-6.0/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-7hc6kw6e/test/node1/bin/scylla', '--options-file', '/jenkins/workspace/scylla-6.0/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-7hc6kw6e/test/node1/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.51.1', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.51.1', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True

you need a way to correlate between test failure listing on a specific address (and you have the process id of who holds it) to the test started that scylla process, so you can investigate why clearing up of the process didn't work in that test.

tchaikov commented 4 weeks ago

@fruch i agree this is not a turn-key solution, but i think this is at least what we can do at this moment. if there is anything else i can do, i would be happy to make it happen.

fruch commented 4 weeks ago

@fruch i agree this is not a turn-key solution, but i think this is at least what we can do at this moment. if there is anything else i can do, i would be happy to make it happen.

I think adding a print of the scylla process, when we start it, would make this PR a bit more useful, and we would be able to hunt down which test have started the process

tchaikov commented 4 weeks ago

v2:

@fruch could you take another look?

fruch commented 4 weeks ago

v2:

  • print out scylla instance's pid after launching it.

@fruch could you take another look?

I don't see this log print... it's on it's own commit ?

tchaikov commented 4 weeks ago

v2:

  • print out scylla instance's pid after launching it.

@fruch could you take another look?

I don't see this log print... it's on it's own commit ?

sorry, it was on another branch. now included in this PR.

fruch commented 4 weeks ago

v2:

  • print out scylla instance's pid after launching it.

@fruch could you take another look?

I don't see this log print... it's on it's own commit ?

sorry, it was on another branch. now included in this PR.

yeah looks better