Open fruch opened 3 years ago
Refs scylladb/scylla#7244
@fruch who's working on this?
@fruch who's working on this?
Currently no one... there are more incidents of that getting stuck ?
Yes, this frequently happens in dtest, see scylladb/scylla#7244 We need to get to this ASAP.
I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)
I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)
Yes, mostly. Some recent examples: https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split005___test_insert_data_after_recreating_cf/ https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split004___test_insert_data_after_recreating_ks/ https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/limits_test/TestLimits/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split060___max_column_value_size_test/
@elcallio seem that other places point to using SIGQUIT, (and not SIGINT), https://access.redhat.com/solutions/18178
I wasn't able to reproduce those issue, but I want to make sure I'm using the correct signal (and capturing the correct logs)
Unfortunately, we still don't see the stacktrace :-( https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/dtest.log
2021-02-16 13:43:55,626 23284 ccm ERROR | node2: nodetool timeout, going to kill scylla-jmx with SIGQUIT
2021-02-16 13:44:00,633 23284 dtest DEBUG | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]
Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-4i__py64/test/node2/conf/scylla.yaml
Connecting to http://127.0.60.2:10000
Starting the JMX server
JMX is enabled to receive remote connections on port: 7260
Interestingly, but likely unrelated, shortly after, in a following test, there's this:
2021-02-16 13:47:50,886 2315 dtest DEBUG | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]
Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-f_t8avbo/test/node1/conf/scylla.yaml
Error: JMX connector server communication error: service:jmx:rmi://76836c581099:7177
sun.management.AgentConfigurationError: java.rmi.server.ExportException: Port already in use: 7177; nested exception is:
java.net.BindException: Address already in use (Bind failed)
at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:800)
at sun.management.jmxremote.ConnectorBootstrap.startRemoteConnectorServer(ConnectorBootstrap.java:468)
at sun.management.Agent.startAgent(Agent.java:262)
at sun.management.Agent.startAgent(Agent.java:452)
Caused by: java.rmi.server.ExportException: Port already in use: 7177; nested exception is:
java.net.BindException: Address already in use (Bind failed)
at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:346)
at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254)
at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411)
at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:236)
at sun.management.jmxremote.ConnectorBootstrap$PermanentExporter.exportObject(ConnectorBootstrap.java:199)
at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:146)
at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:122)
at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:796)
... 3 more
Caused by: java.net.BindException: Address already in use (Bind failed)
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
at java.net.ServerSocket.bind(ServerSocket.java:375)
at java.net.ServerSocket.<init>(ServerSocket.java:237)
at java.net.ServerSocket.<init>(ServerSocket.java:128)
at sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
at sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666)
at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335)
Note that a different post is in use.
It smells like scylla-jmx might not exit cleanly and is holding on to the api port:
bhalevy@dt tmp$ egrep 'Unable to connect to Scylla API server|TimeoutExpired' dtest-745.log
2021-02-16 10:35:32,995 30817 dtest DEBUG | nodetool_additional_test.py:TestNodetool.scrub_with_one_node_expect_data_loss_test - Test failed with errors: [(<nodetool_additional_test.TestNodetool testMethod=scrub_with_one_node_expect_data_loss_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.55.1 -p 7155 scrub ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab86230>))]
2021-02-16 13:44:00,633 23284 dtest DEBUG | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]
2021-02-16 13:47:50,886 2315 dtest DEBUG | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]
2021-02-16 13:59:12,626 20989 dtest DEBUG | snapshot_test.py:TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables - Test failed with errors: [(<snapshot_test.TestSchemaFileInSnapshot testMethod=test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab97910>))]
2021-02-16 14:13:00,291 1890 dtest DEBUG | sstableloader_test.py:TestMigration_with_2_1_x.migrate_sstable_with_compact_storage_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_2_1_x testMethod=migrate_sstable_with_compact_storage_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ac84910>))]
2021-02-16 14:24:57,385 32096 dtest DEBUG | sstableloader_test.py:TestMigration_with_3_0_mc_prepared.migrate_sstable_without_compression_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_mc_prepared testMethod=migrate_sstable_without_compression_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab8ca50>))]
2021-02-16 14:28:56,148 18175 dtest DEBUG | sstableloader_test.py:TestMigration_with_3_0_md_prepared.migrate_sstable_with_range_tombstone_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_md_prepared testMethod=migrate_sstable_with_range_tombstone_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116abac730>))]
2021-02-16 14:34:15,017 10842 dtest DEBUG | sstableloader_test.py:TestMigration_with_3_0_x_prepared.migrate_sstable_with_wrong_partitioner_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_x_prepared testMethod=migrate_sstable_with_wrong_partitioner_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab4b050>))]
2021-02-16 15:11:51,915 14222 dtest DEBUG | update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_kill_new_node_while_bootstrapping_test - Test failed with errors: [(<update_cluster_layout_tests.TestUpdateClusterLayout testMethod=simple_kill_new_node_while_bootstrapping_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 status' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a973f00>))]
2021-02-16 15:37:48,009 9598 dtest DEBUG | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_detector_with_ttl_on_row - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_large_cell_detector_with_ttl_on_row>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab1e050>))]
2021-02-16 15:52:13,261 22285 dtest DEBUG | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_multiple_rows_with_large_cells_detector - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_multiple_rows_with_large_cells_detector>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f1170357640>))]
2021-02-16 15:58:03,477 6169 dtest DEBUG | wide_rows_test.py:TestWideRows_with_SizeTieredCompactionStrategy.test_large_cell_in_materialized_view - Test failed with errors: [(<wide_rows_test.TestWideRows_with_SizeTieredCompactionStrategy testMethod=test_large_cell_in_materialized_view>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9686e0>))]
2021-02-16 16:39:23,616 9355 dtest DEBUG | materialized_views_test.py:TestMaterializedViews.base_replica_repair_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=base_replica_repair_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 repair ks t' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f11686f9cd0>))]
2021-02-16 16:54:31,954 23123 dtest DEBUG | materialized_views_test.py:TestMaterializedViews.drop_mv_during_base_table_writes_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=drop_mv_during_base_table_writes_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f117035c230>))]
2021-02-16 18:15:09,388 26671 dtest DEBUG | bypass_cache_test.py:TestBypassCache.test_alter_table_caching_enable - Test failed with errors: [(<bypass_cache_test.TestBypassCache testMethod=test_alter_table_caching_enable>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9b6b90>))]
@bhalevy, oh great so it even made things worse ?
@bhalevy, oh great so it even made things worse ?
I'm not sure, it may have exposed an existing issue.
Looking at scylla_node.do_stop
https://github.com/scylladb/scylla-ccm/blob/bc4dcedd48441e768fce9567950d3de977f01886/ccmlib/scylla_node.py#L692-L726
It looks like we're not stopping scylla_jmx if the node is not considered running.
I think that to be on the safe side will we shouldn't do this optimization and attempt to stop/kill all process regardless of node.is_running()
, certainly with gently=False
.
@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...
@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...
is those happening only on run on monster
? i.e. when all the suite is run in parallel ?
is those happening only on run on
monster
? i.e. when all the suite is run in parallel ?
The timeouts happen also with no parallelism. See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release-parallel/33/artifact/logs-release.005/dtest.log that's with `NOSE_PROCESSES=1'
The port-in-use issue could be unrelated. If the scylla process dies for any reason we will not stop scylla_jmx.
@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...
@fruch I doubt that it's a matter of time. Dumping the stack trace should be instantaneous. Maybe it's hung so hard it doesn't even respond to the signal.
Cc @penberg
We had few reports across the broad that nodetool commands are getting stuck.
a suggestion was raise by @elcallio to try catch and collect enough information when those things happen: https://github.com/scylladb/scylla/issues/7991#issuecomment-771468592