Closed jcsp closed 2 years ago
@dotnwat I haven't dug into the logs, but is this something ever expected under normal circumstances?
While this is "only" a BadLogLines, it's surprising to me that this test is hitting any encoding hiccups at all, as it's not overtly doing anything with mixed versions or upgrades.
Right, no upgrades can still do this. Transports start off assuming old nodes, and immediately upgrade after a round trip or two. There were a few cases I fixed last week where an RPC error reply was encoded with the wrong transport version (old version) due to an unexamined error handling path.
This is probably similar, but because it is an issue with the reply being upgraded, not downgraded, it is more likely that it is a problem with the type system tricks we are using to select between adl / serde. At least that's my hunch at this point.
Another one: https://buildkite.com/redpanda/redpanda/builds/13173#01823f39-c0bb-401e-b94f-e4832ec8627d from https://buildkite.com/redpanda/redpanda/builds/13173#01823f39-c0bb-401e-b94f-e4832ec8627d
test_id: rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery
status: FAIL
run time: 2 minutes 33.439 seconds
<BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/root/tests/rptest/services/redpanda.py", line 1124, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines:
Another badloglines example
https://buildkite.com/redpanda/redpanda/builds/13171#01823ef7-8a0f-4c14-b1e5-942de9800afe
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 10:22:51,731 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:20">
I'll follow John's comment above with another r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:5
error in this test here.
I haven't dug into these two buildkite links that you left @ajfabbri and @jcsp but the badloglines don't show the protocol violation. are we just recording failures of the test regardless of the way they present?
In general I like to have different failure modes in different tickets, but in the case of a brand new test that's failing several ways, I think it makes sense to bundle them up here.
Regarding the cannot find consensus group
error logs, I looked into it a bit and I think it's worth reducing severity (at least in some cases). See https://github.com/redpanda-data/redpanda/pull/5742 for more context.
@ztlpn does this one look interesting? im not sure what all the partition interruption work is doing:
<BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">
--
| Traceback (most recent call last):
| File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
| data = self.run_test()
| File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
| return self.test_context.function(self.test)
| File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
| return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
| File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
| self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
| File "/root/tests/rptest/services/redpanda.py", line 1124, in raise_on_bad_logs
| raise BadLogLines(bad_lines)
| rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">
Note that the Protocol violation
didn't occur in any of the extra reported cases in this issue.
The scenarios under which the protocol violation might occur are becoming more exotic (corruption, mismatched correlation id). It's as if the response is from a previous instance of the transport (ie from before a reconnection transport reset) and then the correlation ids matched.
looked into this again, and it continues to be a mystery.
the leading contender root cause is that there is a bug in the reconnect_transport resetting logic and a reply to a request from a previous "generation" of the reset reconnect_transport ends up being handled mistakenly.
assuming such a bug does exist, then it as a root cause for this issue would imply that we a pretty serious bug because correlation id matches incorrectly. this would be far more likely for two back-to-back reconnects where the correlation id is small. it's reset to 0 so we'd look for scenarios where the correlation id had't had time to become large, thus increasing the liklihood of a collision after reset.
the other potential cause, and i haven't seen any evidence to support it, is some request/reply that by-passes normal request processing code paths. the response in question is append_entries_reply
. any thought there might be a special case rpc scenario @mmaslankaprv ?
Another failure observed here: https://buildkite.com/redpanda/redpanda/builds/13600#01826a4d-da7d-49a9-b547-18136d155717
an interesting bit is here
INFO 2022-08-04 20:23:48,024 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:20480, got:1044318496 - {version:0, header_checksum:20480, compression:0, payload_size:786432, meta:131072, correlation_id:2097152, payload_checksum:0}
we are reading a header off the wire and seeing a checksum mismatch. it also has the header's version as 0, which is peculiar. a few lines later we see what looks to be the same fiber (though, gosh, these logs could be improved) stating that the header arrived from the server (the same parse_utils.h:59 - rpc header
is used on client and server...)
DEBUG 2022-08-04 20:23:48,025 [shard 1] rpc - transport.cc:233 - could not parse header from server: {host: docker-rp-19, port: 33145}
TRACE 2022-08-04 20:23:47,958 [shard 1] raft - [follower: {id: {3}, revision: {32}}] [group_id:3, {kafka/topic-umqeaiwvmz/2}] - recovery_stm.cc:221 - Reading batches, starting from: 23721
TRACE 2022-08-04 20:23:47,958 [shard 1] storage - readers_cache.cc:101 - {kafka/topic-umqeaiwvmz/2} - trying to get reader for: {start_offset:{23721}, max_offset:{9223372036854775807}, min_bytes:1, max_bytes:524288, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-08-04 20:23:47,958 [shard 1] storage - readers_cache.cc:130 - {kafka/topic-umqeaiwvmz/2} - reader cache miss for: {start_offset:{23721}, max_offset:{9223372036854775807}, min_bytes:1, max_bytes:524288, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-08-04 20:23:47,982 [shard 1] r/heartbeat - heartbeat_manager.cc:262 - Heartbeat timeout, node: 3
ERROR 2022-08-04 20:23:48,023 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply
TRACE 2022-08-04 20:23:48,023 [shard 1] kafka - requests.cc:97 - [172.16.16.7:38182] processing name:fetch, key:1, version:11 for consumer-test_group-1, mem_units: 8216
TRACE 2022-08-04 20:23:48,024 [shard 1] kafka - fetch.cc:549 - handling fetch request: {replica_id=-1 max_wait_ms=500 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=1073741824 session_epoch=223 topics={} forgotten={{name={topic-umqeaiwvmz} forgotten_partition_indexes={14, 16, 6, 2, 19}}} rack_id=}
INFO 2022-08-04 20:23:48,024 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:20480, got:1044318496 - {version:0, header_checksum:20480, compression:0, payload_size:786432, meta:131072, correlation_id:2097152, payload_checksum:0}
TRACE 2022-08-04 20:23:48,024 [shard 1] storage - readers_cache.cc:101 - {kafka/topic-umqeaiwvmz/9} - trying to get reader for: {start_offset:{31900}, max_offset:{32177}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
TRACE 2022-08-04 20:23:48,025 [shard 1] storage - readers_cache.cc:130 - {kafka/topic-umqeaiwvmz/9} - reader cache miss for: {start_offset:{31900}, max_offset:{32177}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
DEBUG 2022-08-04 20:23:48,025 [shard 1] rpc - transport.cc:233 - could not parse header from server: {host: docker-rp-19, port: 33145}
the full sequence is probably that this happens first for some reason:
ERROR 2022-08-04 20:23:48,023 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply
which signals the body via sctx->signal_body_parse();. that signal then causes the next header on the wire to be read which results in the header checksum errors below the protocol violation above in the log.
so... is there really protocol violation? or... does the wire have junk?
It looks like on the server a second or so before the trace above the client sent a bad message. but there are others of these from other nodes and it could be unrelated. like maybe the client just disconnected?
DEBUG 2022-08-04 20:23:47,204 [shard 1] rpc - could not parse header from client: 172.16.16.24:54811
Another test triggered a violation:
FAIL test: PartitionBalancerTest.test_unavailable_nodes (1/67 runs)
failure at 2022-08-07T07:25:17.039Z: <BadLogLines nodes=docker-rp-20(1) example="ERROR 2022-08-07 06:09:28,345 [shard 0] rpc - transport.h:179 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::heartbeat_reply">
in job https://buildkite.com/redpanda/redpanda/builds/13738#018276bf-fdd9-45e5-92f4-f783a59bda04
@dotnwat found this issue by searching "could not parse header"
I am seeing some strange rpc behavior as well in the following test run: https://github.com/redpanda-data/redpanda/issues/5471#issuecomment-1209505474
Basically there is a problem with completing rpc requests from docker-rp-13
to docker-rp-12
in the interval 03:07:35-03:07:45. What is strange is that both nodes are online, and the problem manifests only in one direction! I.e. requests from docker-rp-12
to docker-rp-13
go through just fine.
Any ideas how this could happen?
@ztlpn
Yesterday we changed the handling for the protocol violation so that it would be handled in a similar way to something like a crc check failure--ie close down the connection and rely on higher level to reconnect.
I will have a closer look. It is very odd that it only happens in one direction!
started a new round of debug runs: https://github.com/redpanda-data/redpanda/pull/5931 still unable to reproduce locally
Note here for myself
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-2(2) example="ERROR 2022-08-12 04:38:07,862 [shard 1]
rpc - transport.h:192 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version
rpc::transport_version::v2 status rpc::status::success req_g 33 rep_g 33 req_corr 2 curr_corr 7 count 5470 ctx_count 5470 reply
type raft::append_entries_reply">
Extra context in logs strongly suggests that this isnt a concurrency bug as it looks like no transport resets are occurring between request starting and response being handled.
Moving on now to try and log all requests/responses and see if we can match them on the server side and client side.
It is still curious that the correlation IDs when this happen are small. Like 2-4. This is means that the request was sent out very shortly after a transport reset.
@dotnwat what is the current status of this issue ?
In the last 30 days this test failed several times, but all with Unexpected files in data directory
which seems to be something common. Maybe a clean-up issue:
[nwatkins@fedora pandaresults]$ cat log.txt | grep "Unexpected files in data di" | wc -l
229
log
[nwatkins@fedora pandaresults]$ env/bin/python3 main.py dev 30d 2>&1 | tee log.txt
[nwatkins@fedora pandaresults]$ cat log.txt | grep -A1 test_cancelling_partition_move_x_core
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=False.recovery=no_recovery.compacted=True (1/119 runs)
failure at 2022-10-11T07:31:23.471Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False (1/119 runs)
failure at 2022-10-11T07:31:23.471Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=True (1/119 runs)
failure at 2022-10-11T07:31:23.471Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=False.recovery=no_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=False.recovery=restart_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=True.recovery=no_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=False.recovery=no_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=True.recovery=restart_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=False.recovery=restart_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=restart_recovery (2/458 runs)
failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
https://github.com/redpanda-data/redpanda/pull/6782 removed ok_to_fail for this test (https://github.com/redpanda-data/redpanda/pull/6782/files)
On August 13 we merged a change which did not reset the correlation ID in the RPC because we were seeing unexplainable failures in which it appeared a race was occuring and the wrong response was being matched up with a request.
commit d4e122930e4170d7035162db7121ca2f12d26b29 (origin/rpc-skip-corr-reset)
Author: Noah Watkins <noahwatkins@gmail.com>
Date: Sat Aug 13 09:57:25 2022 -0700
rpc: retain correlation idx across transport resets
The rpc client transport works as an unordered stream of requests and
responses. In order to match a response to a request each request
contains a unique correlation id which the server includes the header of
a response. The client tracks inflight requests by correlation id and
matches the response using an index.
The last report of this issue was August 19 (for a PR that was up for weeks prior to the 19th) so could presumably have been reported from a CI run that was much closer to the 13th in time.
This was also reported once on August 13 around the same time that the correlation ID fix went in.
@jcsp i think this can be closed based on the previous comment's analysis. wdyt?
I concur.
FAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=restart_recovery (1/84 runs) failure at 2022-07-23T08:22:18.320Z:
in job https://buildkite.com/redpanda/redpanda/builds/12971#018229ae-5aed-4afd-9a59-3d4331bfdc09
While this is "only" a BadLogLines, it's surprising to me that this test is hitting any encoding hiccups at all, as it's not overtly doing anything with mixed versions or upgrades.