scylladb / python-driver

ScyllaDB Python Driver, originally DataStax Python Driver for Apache Cassandra
https://python-driver.docs.scylladb.com
Apache License 2.0
70 stars 42 forks source link

The "Found an invalid row for peer" error is obscure #303

Closed bhalevy closed 5 months ago

bhalevy commented 5 months ago

I'm seeing these error in a dtest that adds 2 new nodes (in parallel) to a 3-nodes cluster.

08:56:37,665 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node4: Starting scylla: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/scylla', '--options-file', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/no
de4/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.73.4', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.73.4', '--unsafe-bypass-fsync', '1', '--kernel-page-cache'
, '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=False
08:56:37,667 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node4: Starting scylla-jmx: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.73.4', '-Djavax.managemen
t.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.73.4', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.73.4', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.
management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/scylla-jmx-1.0.jar']
08:56:37,972 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node5: Starting scylla: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/scylla', '--options-file', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/no
de5/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.73.5', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.73.5', '--unsafe-bypass-fsync', '1', '--kernel-page-cache'
, '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=False
08:56:37,973 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node5: Starting scylla-jmx: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.73.5', '-Djavax.managemen
t.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.73.5', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.73.5', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.
management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/scylla-jmx-1.0.jar']
08:56:43,620 1736154 cassandra.cluster              WARNING  thread.py           :58   | test_tablets_add_two_nodes_in_parallel: Found an invalid row for peer (127.0.73.5). Ignoring host.
08:56:43,620 1736154 cassandra.cluster              INFO     thread.py           :58   | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.4:9042 datacenter1> discovered
08:56:45,521 1736154 cassandra.cluster              WARNING  thread.py           :58   | test_tablets_add_two_nodes_in_parallel: Found an invalid row for peer (127.0.73.5). Ignoring host.
08:56:46,585 1736154 update_cluster_layout_tests    DEBUG    update_cluster_layout_tests.py:436  | test_tablets_add_two_nodes_in_parallel: Check that nodes started successfully
08:56:46,586 1736154 update_cluster_layout_tests    DEBUG    update_cluster_layout_tests.py:441  | test_tablets_add_two_nodes_in_parallel: Verify tablet load-balancing after 10 seconds
08:56:48,206 1736154 cassandra.cluster              INFO     thread.py           :58   | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.5:9042 datacenter1> discovered
08:56:56,590 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.5:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.3:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.2:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.4:9042 datacenter1> discovered

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from. It would help if we printed more information to facilitate debugging in case something is wrong in the scylla server side.

avelanarius commented 5 months ago

The relevant code that checks the "validity" of a system.peers row: https://github.com/scylladb/python-driver/blob/6b82872ef0bbd1fc9cf78fb069a7d7821e3a6dde/cassandra/cluster.py#L4018-L4021 and the warning printed here (https://github.com/scylladb/python-driver/blob/6b82872ef0bbd1fc9cf78fb069a7d7821e3a6dde/cassandra/cluster.py#L3949-L3953) should print more details (entire row?) and which of the cases caused the invalidity of a the row.

avelanarius commented 5 months ago

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.

I addressed "why the the row in system.peers was considered as invalid" part in #305, however not necessarily the "and on which node it was read from" part but that can be deduced from other logs.

bhalevy commented 5 months ago

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.

I addressed "why the the row in system.peers was considered as invalid" part in #305, however not necessarily the "and on which node it was read from" part but that can be deduced from other logs.

thanks. hopefully it will be enough. heads up @kbr-scylla