Open Scarjit opened 1 year ago
the first crash was caused by
ERROR 2023-08-17 09:37:18,539 [shard 0] assert - jM �U:135912895 @{} - [{}:{}] sending {}:{} for {}, response {}: failed to log message: fmt='Assert failure: ({}:{}) '{}' session mismatch: {}': fmt::v8::format_error (argument not found)
We will look into this and get back. Thanks for reporting, @Scarjit
This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale
label – otherwise this will be closed in two weeks.
Probably this assertion
[nwatkins@fedora redpanda]$ git grep "session mismatch"
src/v/kafka/client/fetch_session.cc: vassert(res.data.session_id == _id, "session mismatch: {}", *this);
Given jM ��U:135912895 @{}
it could be that the assertion failed due to corruption and the actual issue is completely unrelated. But some corruption could have also caused both and that the mismatch is valid...
Adding to enterprise team as a first approximation since the assertion is ostensibly originating from the c++ kakfa client.
I took a look at the logs a bit more in depth, seems like there's some credence to @dotnwat 's initial observations about corruption.
WARN 2023-08-17 09:37:15,660 [shard 1] kafka - group.cc:3482 - Parsing consumer:{range} data for group {mgmt-console-d439cbca276c9356-http} member {pandaproxy_client-03d39b49-30ea-463b-b423-516cce5d3978} failed: std::out_of_range (consumer metadata topic count too large 65535 > 2)
Looks like the data decoded within the decode_consumer_subscriptions
is also corrupt. The payload is a 32 bit signed integer that denotes a number of topic names. Exception was raised because the number of topics claimed by the payload header is large (65535) but theres only 2 bytes left in the payload, an impossible scenario.
Further up theres some unknown_member_id
logs, not entirely sure if that can be attributed to data corruption though.
WARN 2023-08-17 09:37:09,241 [shard 0] kafka/client - client.cc:176 - consumer_error: kafka::client::consumer_error ({mgmt-console-f00ac76ce017e831-http}, {mgmt-companion-30963329340392023066709832861487204713-4479980376504347255-http}, { error_code: unknown_member_id [25] })
Version & Environment
Redpanda version: (use
rpk version
): docker.redpanda.com/redpandadata/redpanda:v23.2.5Please also give versions of other components:
/etc/os-release
): VERSION="22.04.2 LTS (Jammy Jellyfish)"docker info
): -kubectl version
):What went wrong?
redpanda crashed a couple times after applying a new max topic size using
rpk cluster config set retention_bytes 1073741824
. Maybe related: Our cluster had run out of disk, therefore i deleted some topics and afterwards applied the new retention settings.What should have happened instead?
No crash :)
Additional information
See redpanda.log for the actual crash and redpanda2 & redpanda3 for the next restarts.
redpanda.log redpanda2.log redpanda3.log
JIRA Link: CORE-1409