scylladb / scylla-ccm

Cassandra Cluster Manager, modified for Scylla
Apache License 2.0
22 stars 67 forks source link

When nodetool command timeout, kill scylla-jmx with sigquit #303

Open fruch opened 3 years ago

fruch commented 3 years ago

We had few reports across the broad that nodetool commands are getting stuck.

a suggestion was raise by @elcallio to try catch and collect enough information when those things happen: https://github.com/scylladb/scylla/issues/7991#issuecomment-771468592

bhalevy commented 3 years ago

Refs scylladb/scylla#7244

bhalevy commented 3 years ago

@fruch who's working on this?

fruch commented 3 years ago

@fruch who's working on this?

Currently no one... there are more incidents of that getting stuck ?

bhalevy commented 3 years ago

Yes, this frequently happens in dtest, see scylladb/scylla#7244 We need to get to this ASAP.

fruch commented 3 years ago

I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)

bhalevy commented 3 years ago

I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)

Yes, mostly. Some recent examples: https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split005___test_insert_data_after_recreating_cf/ https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split004___test_insert_data_after_recreating_ks/ https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/limits_test/TestLimits/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split060___max_column_value_size_test/

bhalevy commented 3 years ago

Also https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/718/testReport/cfid_test/TestCFID/cfid_test/

fruch commented 3 years ago

@elcallio seem that other places point to using SIGQUIT, (and not SIGINT), https://access.redhat.com/solutions/18178

I wasn't able to reproduce those issue, but I want to make sure I'm using the correct signal (and capturing the correct logs)

bhalevy commented 3 years ago

Unfortunately, we still don't see the stacktrace :-( https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/dtest.log

2021-02-16 13:43:55,626 23284   ccm                            ERROR    | node2: nodetool timeout, going to kill scylla-jmx with SIGQUIT

2021-02-16 13:44:00,633 23284   dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]

But https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613483043675_secondary_indexes_test.TestSecondaryIndexes.test_insert_data_after_recreating_cf/node2_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-4i__py64/test/node2/conf/scylla.yaml
Connecting to http://127.0.60.2:10000
Starting the JMX server
JMX is enabled to receive remote connections on port: 7260
bhalevy commented 3 years ago

Interestingly, but likely unrelated, shortly after, in a following test, there's this:

2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613483955639_snapshot_test.TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables/node1_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-f_t8avbo/test/node1/conf/scylla.yaml
Error: JMX connector server communication error: service:jmx:rmi://76836c581099:7177
sun.management.AgentConfigurationError: java.rmi.server.ExportException: Port already in use: 7177; nested exception is: 
    java.net.BindException: Address already in use (Bind failed)
    at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:800)
    at sun.management.jmxremote.ConnectorBootstrap.startRemoteConnectorServer(ConnectorBootstrap.java:468)
    at sun.management.Agent.startAgent(Agent.java:262)
    at sun.management.Agent.startAgent(Agent.java:452)
Caused by: java.rmi.server.ExportException: Port already in use: 7177; nested exception is: 
    java.net.BindException: Address already in use (Bind failed)
    at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:346)
    at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254)
    at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411)
    at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
    at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:236)
    at sun.management.jmxremote.ConnectorBootstrap$PermanentExporter.exportObject(ConnectorBootstrap.java:199)
    at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:146)
    at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:122)
    at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
    at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:796)
    ... 3 more
Caused by: java.net.BindException: Address already in use (Bind failed)
    at java.net.PlainSocketImpl.socketBind(Native Method)
    at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
    at java.net.ServerSocket.bind(ServerSocket.java:375)
    at java.net.ServerSocket.<init>(ServerSocket.java:237)
    at java.net.ServerSocket.<init>(ServerSocket.java:128)
    at sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
    at sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
    at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666)
    at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335)

Note that a different post is in use.

bhalevy commented 3 years ago

It smells like scylla-jmx might not exit cleanly and is holding on to the api port:

bhalevy@dt tmp$ egrep 'Unable to connect to Scylla API server|TimeoutExpired' dtest-745.log
2021-02-16 10:35:32,995 30817   dtest                          DEBUG    | nodetool_additional_test.py:TestNodetool.scrub_with_one_node_expect_data_loss_test - Test failed with errors: [(<nodetool_additional_test.TestNodetool testMethod=scrub_with_one_node_expect_data_loss_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.55.1 -p 7155 scrub ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab86230>))]
2021-02-16 13:44:00,633 23284   dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]
2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]
2021-02-16 13:59:12,626 20989   dtest                          DEBUG    | snapshot_test.py:TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables - Test failed with errors: [(<snapshot_test.TestSchemaFileInSnapshot testMethod=test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab97910>))]
2021-02-16 14:13:00,291 1890    dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_2_1_x.migrate_sstable_with_compact_storage_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_2_1_x testMethod=migrate_sstable_with_compact_storage_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ac84910>))]
2021-02-16 14:24:57,385 32096   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_mc_prepared.migrate_sstable_without_compression_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_mc_prepared testMethod=migrate_sstable_without_compression_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab8ca50>))]
2021-02-16 14:28:56,148 18175   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_md_prepared.migrate_sstable_with_range_tombstone_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_md_prepared testMethod=migrate_sstable_with_range_tombstone_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116abac730>))]
2021-02-16 14:34:15,017 10842   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_x_prepared.migrate_sstable_with_wrong_partitioner_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_x_prepared testMethod=migrate_sstable_with_wrong_partitioner_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab4b050>))]
2021-02-16 15:11:51,915 14222   dtest                          DEBUG    | update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_kill_new_node_while_bootstrapping_test - Test failed with errors: [(<update_cluster_layout_tests.TestUpdateClusterLayout testMethod=simple_kill_new_node_while_bootstrapping_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 status' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a973f00>))]
2021-02-16 15:37:48,009 9598    dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_detector_with_ttl_on_row - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_large_cell_detector_with_ttl_on_row>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab1e050>))]
2021-02-16 15:52:13,261 22285   dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_multiple_rows_with_large_cells_detector - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_multiple_rows_with_large_cells_detector>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f1170357640>))]
2021-02-16 15:58:03,477 6169    dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_SizeTieredCompactionStrategy.test_large_cell_in_materialized_view - Test failed with errors: [(<wide_rows_test.TestWideRows_with_SizeTieredCompactionStrategy testMethod=test_large_cell_in_materialized_view>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9686e0>))]
2021-02-16 16:39:23,616 9355    dtest                          DEBUG    | materialized_views_test.py:TestMaterializedViews.base_replica_repair_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=base_replica_repair_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 repair ks t' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f11686f9cd0>))]
2021-02-16 16:54:31,954 23123   dtest                          DEBUG    | materialized_views_test.py:TestMaterializedViews.drop_mv_during_base_table_writes_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=drop_mv_during_base_table_writes_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f117035c230>))]
2021-02-16 18:15:09,388 26671   dtest                          DEBUG    | bypass_cache_test.py:TestBypassCache.test_alter_table_caching_enable - Test failed with errors: [(<bypass_cache_test.TestBypassCache testMethod=test_alter_table_caching_enable>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9b6b90>))]
fruch commented 3 years ago

@bhalevy, oh great so it even made things worse ?

bhalevy commented 3 years ago

@bhalevy, oh great so it even made things worse ?

I'm not sure, it may have exposed an existing issue. Looking at scylla_node.do_stop https://github.com/scylladb/scylla-ccm/blob/bc4dcedd48441e768fce9567950d3de977f01886/ccmlib/scylla_node.py#L692-L726

It looks like we're not stopping scylla_jmx if the node is not considered running. I think that to be on the safe side will we shouldn't do this optimization and attempt to stop/kill all process regardless of node.is_running(), certainly with gently=False.

fruch commented 3 years ago

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

fruch commented 3 years ago

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

is those happening only on run on monster ? i.e. when all the suite is run in parallel ?

bhalevy commented 3 years ago

is those happening only on run on monster ? i.e. when all the suite is run in parallel ?

The timeouts happen also with no parallelism. See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release-parallel/33/artifact/logs-release.005/dtest.log that's with `NOSE_PROCESSES=1'

bhalevy commented 3 years ago

The port-in-use issue could be unrelated. If the scylla process dies for any reason we will not stop scylla_jmx.

bhalevy commented 3 years ago

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

@fruch I doubt that it's a matter of time. Dumping the stack trace should be instantaneous. Maybe it's hung so hard it doesn't even respond to the signal.

bhalevy commented 3 years ago

Cc @penberg