redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.51k stars 580 forks source link

Crash after removing topics + modifying cluster config. #12870

Open Scarjit opened 1 year ago

Scarjit commented 1 year ago

Version & Environment

Redpanda version: (use rpk version): docker.redpanda.com/redpandadata/redpanda:v23.2.5

Please also give versions of other components:

What went wrong?

redpanda crashed a couple times after applying a new max topic size using rpk cluster config set retention_bytes 1073741824. Maybe related: Our cluster had run out of disk, therefore i deleted some topics and afterwards applied the new retention settings.

What should have happened instead?

No crash :)

Additional information

See redpanda.log for the actual crash and redpanda2 & redpanda3 for the next restarts.

redpanda.log redpanda2.log redpanda3.log

JIRA Link: CORE-1409

andijcr commented 1 year ago

the first crash was caused by

ERROR 2023-08-17 09:37:18,539 [shard 0] assert - jM �U:135912895 @{} - [{}:{}] sending {}:{} for {}, response {}: failed to log message: fmt='Assert failure: ({}:{}) '{}' session mismatch: {}': fmt::v8::format_error (argument not found)
piyushredpanda commented 1 year ago

We will look into this and get back. Thanks for reporting, @Scarjit

github-actions[bot] commented 9 months ago

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

dotnwat commented 8 months ago

Probably this assertion

[nwatkins@fedora redpanda]$ git grep "session mismatch"
src/v/kafka/client/fetch_session.cc:    vassert(res.data.session_id == _id, "session mismatch: {}", *this);
dotnwat commented 8 months ago

Given jM ��U:135912895 @{} it could be that the assertion failed due to corruption and the actual issue is completely unrelated. But some corruption could have also caused both and that the mismatch is valid...

dotnwat commented 8 months ago

Adding to enterprise team as a first approximation since the assertion is ostensibly originating from the c++ kakfa client.

graphcareful commented 8 months ago

I took a look at the logs a bit more in depth, seems like there's some credence to @dotnwat 's initial observations about corruption.

WARN  2023-08-17 09:37:15,660 [shard 1] kafka - group.cc:3482 - Parsing consumer:{range} data for group {mgmt-console-d439cbca276c9356-http} member {pandaproxy_client-03d39b49-30ea-463b-b423-516cce5d3978} failed: std::out_of_range (consumer metadata topic count too large 65535 > 2)

Looks like the data decoded within the decode_consumer_subscriptions is also corrupt. The payload is a 32 bit signed integer that denotes a number of topic names. Exception was raised because the number of topics claimed by the payload header is large (65535) but theres only 2 bytes left in the payload, an impossible scenario.

Further up theres some unknown_member_id logs, not entirely sure if that can be attributed to data corruption though.

WARN  2023-08-17 09:37:09,241 [shard 0] kafka/client - client.cc:176 - consumer_error: kafka::client::consumer_error ({mgmt-console-f00ac76ce017e831-http}, {mgmt-companion-30963329340392023066709832861487204713-4479980376504347255-http}, { error_code: unknown_member_id [25] })