redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.41k stars 577 forks source link

CI Failure (key symptom) in `PartitionMoveInterruption.test_cancelling_partition_move` #20758

Open vbotbuildovich opened 2 months ago

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/50894

Module: rptest.tests.partition_move_interruption_test
Class: PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments: {
    "recovery": "restart_recovery",
    "compacted": false,
    "unclean_abort": true,
    "replication_factor": 1
}

test_id:    PartitionMoveInterruption.test_cancelling_partition_move
status:     FAIL
run time:   16.925 seconds

ConnectionError(MaxRetryError("HTTPConnectionPool(host='docker-rp-12', port=9644): Max retries exceeded with url: /v1/security/users?redirect=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused'))"))
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-12', port=9644): Max retries exceeded with url: /v1/security/users?redirect=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 158, in test_cancelling_partition_move
    self.start_redpanda(num_nodes=5,
  File "/root/tests/rptest/tests/end_to_end.py", line 129, in start_redpanda
    self.redpanda.start(nodes=started_nodes,
  File "/root/tests/rptest/services/redpanda.py", line 2731, in start
    self._admin.create_user(*self._superuser)
  File "/root/tests/rptest/services/admin.py", line 1128, in create_user
    self._request("POST",
  File "/root/tests/rptest/services/admin.py", line 640, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-12', port=9644): Max retries exceeded with url: /v1/security/users?redirect=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused'))
Test requested 7 nodes, used only 5```

JIRA Link: [CORE-5034](https://redpandadata.atlassian.net/browse/CORE-5034)
dotnwat commented 2 months ago

node crashed

ERROR 2024-06-28 20:41:19,249 [shard 0:admi] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-02363aa4813e8de7f-1/redpanda/redpanda/src/v/raft/mux_state_machine.h:344) 'it != _results.end()' last applied offset 33 is greater than returned replicate result 33. this must imply existence of a result in results map
ERROR 2024-06-28 20:41:19,249 [shard 0:admi] assert - Backtrace below:
0xb1b26 /opt/redpanda_installs/ci/lib/libseastar.so+0x221563b /opt/redpanda_installs/ci/lib/libseastar.so+0x2217128 /opt/redpanda_installs/ci/lib/libseastar.so+0x2217ac0 /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca498f8 /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca4860a /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca48217 /opt/redpanda_installs/ci/lib/libv_v_application.so+0x43574dc /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4357053 /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4356e06 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac2792 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac8d36 /opt/redpanda_installs/ci/lib/libseastar.so+0x1acb8b5 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac9d47 /opt/redpanda_installs/ci/lib/libseastar.so+0x185ca65 /opt/redpanda_installs/ci/lib/libseastar.so+0x185adff /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4a6eb8a 0x143a67 /opt/redpanda_installs/ci/lib/libc.so.6+0x2a087 /opt/redpanda_installs/ci/lib/libc.so.6+0x2a14a 0x6ca84
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::futurize<seastar::future<void>>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>>(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::futurize<seastar::future<void>>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>>(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
mmaslankaprv commented 2 months ago

The assertion was triggered because the same base offset was assigned twice to different batches:

TRACE 2024-06-28 20:41:19,236 [shard 0:admi] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 20, base_offset: 33, last_offset: 33, last_term: 1, byte_size: 297}

TRACE 2024-06-28 20:41:19,242 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 3, base_offset: 33, last_offset: 33, last_term: 1, byte_size: 75}
mmaslankaprv commented 2 months ago

The problem is caused by a race condition i a storage layer

abhijat commented 1 week ago

Seen this same assert failure in https://buildkite.com/redpanda/redpanda/builds/53808 several times

docker-rp-14/redpanda.log:TRACE 2024-08-30 10:59:34,874 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 1, base_offset: 24, last_offset: 24, last_term: 1, byte_size: 107}
docker-rp-14/redpanda.log:TRACE 2024-08-30 10:59:42,874 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 0, base_offset: 24, last_offset: 24, last_term: 1, byte_size: 107}

In the above build a couple of tests also fail with timeout during alter topic config, but looking at the logs there the same assert failure has crashed the broker, although the timeout occurred a couple of seconds before the crash, so maybe it is a separate issue (still looking into that)

WARN  2024-08-30 10:59:41,498 [shard 0:main] kafka - config_utils.h:286 - Failed to alter topic properties of topic(s) {{kafka/panda-topic}} error_code observed: cluster::errc::timeout

followed by:

ERROR 2024-08-30 10:59:42,876 [shard 0:main] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c0dcb9e76459e3f4-1/redpanda/redpanda/src/v/raft/mux_state_machine.h:344) 'it != _results.end()' last applied offset 24 is greater than returned replicate result 24. this must imply existence of a result in results map