Closed derekperkins closed 3 months ago
@derekperkins can you please let me know if this is still an issue for you? If so, I'll try and get this in my priority queue.
I'm going to close this for now as there's no test case so I cannot verify if it's still an issue or not and we have not received any followup information. We can re-open this at any time if we get more info. Thanks!
Overview of the Issue
I don't know what is causing this for sure, but it seems to happen more often when I have more vstreams / vreplication streams running. I see
message_manager.go:631] Context canceled, exiting vstream
in the logs, which appears to be an unrecoverable error. The stream then exits, but vttablet continues to run. This leaves a table orphaned, still accepting writes, but not issuing any messages to subscribers. This has caused us outages where we believed everything to be healthy, but were in reality not processing data for certain tables on certain shards.What I would like to see happen is for this to recover, or if that isn't reasonable, to shut down the tablet completely so it can come back up again healthy.
When context is canceled, the vstream returns
io.EOF
https://github.com/vitessio/vitess/blob/c9b6d608f700d975105fb400557daa7aa9c30386/go/vt/vttablet/tabletserver/messager/message_manager.go#L658This appears to race with this line, making me believe that this select should be removed. https://github.com/vitessio/vitess/blob/c9b6d608f700d975105fb400557daa7aa9c30386/go/vt/vttablet/tabletserver/messager/message_manager.go#L631-L632
I see this often (always?) with #8909, so I don't know if these errors are somehow causing the deadlock or being caused by the same root error.
Operating system and Environment details
vttablet: 11.0.0 GKE 1.20.9
Log Fragments
Full logs here: https://gist.github.com/derekperkins/299519b0bd9da3618645de908a37ca0c