vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.63k stars 2.1k forks source link

messaging: vstream errors out with EOF, leaving orphaned subscribers #8910

Closed derekperkins closed 3 months ago

derekperkins commented 3 years ago

Overview of the Issue

I don't know what is causing this for sure, but it seems to happen more often when I have more vstreams / vreplication streams running. I see message_manager.go:631] Context canceled, exiting vstream in the logs, which appears to be an unrecoverable error. The stream then exits, but vttablet continues to run. This leaves a table orphaned, still accepting writes, but not issuing any messages to subscribers. This has caused us outages where we believed everything to be healthy, but were in reality not processing data for certain tables on certain shards.

What I would like to see happen is for this to recover, or if that isn't reasonable, to shut down the tablet completely so it can come back up again healthy.

When context is canceled, the vstream returns io.EOF https://github.com/vitessio/vitess/blob/c9b6d608f700d975105fb400557daa7aa9c30386/go/vt/vttablet/tabletserver/messager/message_manager.go#L658

This appears to race with this line, making me believe that this select should be removed. https://github.com/vitessio/vitess/blob/c9b6d608f700d975105fb400557daa7aa9c30386/go/vt/vttablet/tabletserver/messager/message_manager.go#L631-L632

I see this often (always?) with #8909, so I don't know if these errors are somehow causing the deadlock or being caused by the same root error.

Operating system and Environment details

vttablet: 11.0.0 GKE 1.20.9

Log Fragments

Full logs here: https://gist.github.com/derekperkins/299519b0bd9da3618645de908a37ca0c

I0928 23:57:12.052015 1897126 snapshot_conn.go:79] Locking table searches__requester_dataforseo__msgs for copying
I0928 23:57:12.052827 1897126 snapshot_conn.go:72] Tables unlocked: searches__requester_dataforseo__msgs
I0928 23:57:13.093149 1897126 uvstreamer.go:363] Stream() called
I0928 23:57:13.094009 1897126 uvstreamer.go:298] sendEventsForCurrentPos
E0928 23:57:13.094072 1897126 vstreamer.go:907] stream (at source tablet) error @ 081b0660-e829-11ea-80ee-9659fb9bd5cc:1-123920462,08abfae8-6741-11e9-bfae-0a580a303502:1-246462,28cd3b90-7176-11e9-8dde-b60ada6337c4:1-9210479,2ad85940-3011-11ea-86c9-aa44e2131609:1-15541278,2f8f7002-6974-11ea-b000-9261aac98f09:1-196,481b8d62-3011-11ea-a67b-5a04efbaf0d0:1-38068128,4e6824b2-4c6b-11e9-a1e4-5a5fdafc1a75:1-7837863,59abbfb1-716e-11e9-936e-ea2d18dba9f1:1-2124907,5de51511-69ab-11ea-9ccb-76e8377684cb:1-19141164,690d5050-90fe-11e9-8e7e-5a656828eb85:1-42338093,865ca09b-696e-11ea-99be-9aecb41b6617:1-69603,950a9561-51c8-11e9-aab6-c2bcd856a282:1-1452601,c9f24748-6740-11e9-8d1b-0a580a302e02:1-13,ce6241e3-698b-11ea-a58b-962095927369:1-52615,f0b07a5d-596f-11e9-a303-0a580a300808:1-5334026,f271d3c6-69ee-11ea-ac2a-5ec6753a4b21:1-211925176: EOF
I0928 23:57:13.094093 1897126 message_manager.go:631] Context canceled, exiting vstream
I0928 23:57:13.096892 1897126 uvstreamer.go:363] Stream() called
I0928 23:57:13.097522 1897126 uvstreamer.go:298] sendEventsForCurrentPos
E0928 23:57:13.097559 1897126 vstreamer.go:907] stream (at source tablet) error @ 081b0660-e829-11ea-80ee-9659fb9bd5cc:1-123920462,08abfae8-6741-11e9-bfae-0a580a303502:1-246462,28cd3b90-7176-11e9-8dde-b60ada6337c4:1-9210479,2ad85940-3011-11ea-86c9-aa44e2131609:1-15541278,2f8f7002-6974-11ea-b000-9261aac98f09:1-196,481b8d62-3011-11ea-a67b-5a04efbaf0d0:1-38068128,4e6824b2-4c6b-11e9-a1e4-5a5fdafc1a75:1-7837863,59abbfb1-716e-11e9-936e-ea2d18dba9f1:1-2124907,5de51511-69ab-11ea-9ccb-76e8377684cb:1-19141164,690d5050-90fe-11e9-8e7e-5a656828eb85:1-42338093,865ca09b-696e-11ea-99be-9aecb41b6617:1-69603,950a9561-51c8-11e9-aab6-c2bcd856a282:1-1452601,c9f24748-6740-11e9-8d1b-0a580a302e02:1-13,ce6241e3-698b-11ea-a58b-962095927369:1-52615,f0b07a5d-596f-11e9-a303-0a580a300808:1-5334026,f271d3c6-69ee-11ea-ac2a-5ec6753a4b21:1-211925176: EOF
I0928 23:57:13.097564 1897126 message_manager.go:631] Context canceled, exiting vstream
mattlord commented 1 year ago

@derekperkins can you please let me know if this is still an issue for you? If so, I'll try and get this in my priority queue.

mattlord commented 3 months ago

I'm going to close this for now as there's no test case so I cannot verify if it's still an issue or not and we have not received any followup information. We can re-open this at any time if we get more info. Thanks!