Multiple partitions stop consuming

suxiangdong commented 1 year ago

Scene: Dependency on franz-go v1.10.0 AliCloud kafka service: 3 brokers, 2 consumers, 12 partitions

When I upvote the disk, partitions [2, 5, 8, 11] stop consuming, others are normal. But when I restarted the consumers, they all came back to normal again. I found some keywords in debug mode: "map[ta-logbus-v1:map[2:{-1 23817}]]" , "map[ta-logbus-v1:map[2:{23817 e0 ce0} 5:{ 23946 e0 ce0}]]". But I can't understand is why after reboot it goes back to normal.

describe group:

Consumption Partitioning: 0 - 5 consumer1_debug.log Consumption Partitioning: 6 - 11 consumer2_debug.log

twmb commented 11 months ago

Discord link in the readme, https://discord.gg/K4R5c8zsMS. I monitor it occasionally (usually more when there's a bug). Movement should be fine ... it's pretty strongly tested in production currently (and Redpanda CI tests -- which actually showed a bug I am fixing in the next release which is coming out imminently). However, moving partitions is one of the most complicated edge cases, especially for the consumer half of the code.

richardartoul commented 11 months ago

Cool i'm in, whats the bug? Is it possible its the one I'm seeing?

twmb commented 11 months ago

Not possible, I'm fixing in #603, you would see a panic if you ran into what I'm fixing.

suxiangdong commented 11 months ago

Interesting. Want to debug this on Discord again?

Okay. When do you have time?

twmb commented 10 months ago

@suxiangdong Are you available Wednesday of this week, same time?

suxiangdong commented 10 months ago

@suxiangdong Are you available Wednesday of this week, same time?

No problem. See you later.

twmb commented 10 months ago

Diagnosed in Discord:

The brokers get in a weird state after upgrading where they take an extremely long time to reply to fetch requests. I don't know if there's something that can be done on the broker side to diagnose this. Maybe the brokers eventually recover?

I saw the problem persisting even after I restarted the consumer.

However, the reason that consuming outright stops with franz-go while the brokers are slow is because the client is killing connections after a certain amount of time of not receiving a response. The default is for FetchRequest's to use a MaxWaitTime of 5s. The client has a second timeout -- RequestTimeoutOverhead, which says "after this amount of time, if I do not receive a response, kill the connection". The overhead is added to the fetch max wait time, meaning this:

Client sends a FetchRequest which is telling the broker: respond to me within 5s
Client waits 5s, and then another 10s (overhead), and then kills the connection
I locally changed RequestTimeoutOverhead to three minutes, and it appears the broker is replying to the client after >2 minutes

When I change the RequestTimeoutOverhead to be super high, I do get data eventually. Nothing explains why the broker is so slow. Also, the broker is not respecting FetchMaxWait.

We're currently assuming that the reason this works for confluent is because confluent's default timeouts are higher -- the socket timeout is 60s (on top of the fetch timeout) compared to the 10s from franz-go. It's possible that confluent's client just has more time and this it does eventually recover enough.

Closing this for now but we can reopen if necessary.

twmb / franz-go

Multiple partitions stop consuming #516