redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.71k stars 591 forks source link

CI Failure (BadLogLines missing_node_rpc_client) in `RandomNodeOperationsTest.test_node_operations` #10112

Closed NyaliaLui closed 1 year ago

NyaliaLui commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/27206#0187809b-cc6a-48fa-b257-9dde43c9df11

Module: rptest.tests.random_node_operations_test
Class:  RandomNodeOperationsTest
Method: test_node_operations
Arguments:
{
  "enable_controller_snapshots": true,
  "enable_failures": true,
  "num_to_upgrade": 0
}
test_id:    rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.enable_controller_snapshots=True
status:     FAIL
run time:   7 minutes 11.924 seconds

    <BadLogLines nodes=docker-rp-23(1) example="ERROR 2023-04-14 17:16:38,506 [shard 0] admin_api_server - admin_server.cc:493 - [_anonymous] exception intercepted - url: [http://docker-rp-23:9644/v1/cluster_config] http_return_status[500] reason - seastar::httpd::server_error_exception (Unexpected error: rpc::errc::missing_node_rpc_client)">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 87, in wrapped
    redpanda.raise_on_bad_logs(
  File "/root/tests/rptest/services/redpanda.py", line 1877, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-23(1) example="ERROR 2023-04-14 17:16:38,506 [shard 0] admin_api_server - admin_server.cc:493 - [_anonymous] exception intercepted - url: [http://docker-rp-23:9644/v1/cluster_config] http_return_status[500] reason - seastar::httpd::server_error_exception (Unexpected error: rpc::errc::missing_node_rpc_client)">
dotnwat commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/27402#01879577-8a80-46c6-a2eb-75c3a67e35e3

dlex commented 1 year ago

on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29289#01882851-e706-42d6-8c5d-f50a80a86b6c

andijcr commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/30951#01889fbf-4d7a-45f2-a4b7-5bd45d8d6aa6

michael-redpanda commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/31015#0188a407-f5b4-457f-92a9-b18f09e74680

michael-redpanda commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/31233#0188b78c-c490-4d04-8114-b49dcc1db720

ztlpn commented 1 year ago

This is a shutdown issue. A node was stopped shortly before it was asked to decommission itself and it failed:

[INFO  - 2023-06-14 02:02:11,738 - failure_injector - inject_failure - lineno:79]: injecting failure: type: 1, length: 0 seconds, node: docker-rp-2
[INFO  - 2023-06-14 02:02:11,739 - failure_injector - _terminate - lineno:209]: terminating redpanda on docker-rp-2
...
[INFO  - 2023-06-14 02:02:11,786 - node_operations - decommission - lineno:231]: executor - decommissioning node 1 (idx: 2)
[DEBUG - 2023-06-14 02:02:11,789 - admin - _request - lineno:332]: Dispatching put http://docker-rp-2:9644/v1/brokers/1/decommission
[WARNING - 2023-06-14 02:02:11,796 - admin - _request - lineno:350]: Response 500: {"message": "Unexpected error: rpc::errc::missing_node_rpc_client", "code": 500}

Looks like https://github.com/redpanda-data/redpanda/pull/8847 is something that might have caused it?

In any case, low severity, and the fix is to return some kind of "shutting down" error code, instead of a more disturbing rpc::errc::missing_node_rpc_client.

travisdowns commented 1 year ago

Another instance:

FAIL test: RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.enable_controller_snapshots=True (1/31 runs) failure at 2023-06-19T14:48:04.475Z: on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31598#0188d3e8-3679-4d70-ad38-f7d478ba1ddb