RW Source Manager - Failed Sink Job (Inconsistent State between Durable & In-Memory Cache)

npadbidri commented 3 months ago

Describe the bug

As per discussions with Tianxiao Shen, the Source Manager component in Rising Wave, had stopped scheduling the job, due to inconsistent state between the RW Database (durable state) and RW REDIS (in-memory cache). Thus, we were asked to perform ANY or ALL of the 3 options :

Fire RECOVER command
Restart RW Cluster
Drop and Recreate RW Sink

Given that we would have RW Sinks in the tune of 1000s, this anomaly would be catastrophic for Production scenarios. Can you please fix this bug and also let me know, if we could have an Alerting Mechanism for such scenarios.

Error message/log

No errors were reported in our RW Cloud Portal or even programmatically we did not get exceptions. This is most concerning because in a Production scenario when we have about 2000 Sinks running, ALL of them could silently fail, without we being alerted about this.

This primarily means loss of revenue !!!

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

Rising Wave Cloud

The version of RisingWave

v1.10.0-rc.1-patch-us-west-2-11-type-mismatch-patch

Additional context

The issue was reported on Friday 09-August-2024 around 1:30 PM (Indian Standard Time)
The Source was working fine, as we were able to query the Source, however the Sink was not reflecting the data.

fuyufjh commented 3 months ago

Sorry for the inconvenience! Let us take a look.

cc. @xxchan Can you elaborate on

inconsistent state between the RW Database (durable state) and RW REDIS (in-memory cache)

Didn't get what is RW REDIS (in-memory cache). Does it mean a Redis Sink?

xxchan commented 3 months ago

It's source_fragments in source manager being outdated and not consistent with table fragments info.

https://risingwave-labs.slack.com/archives/C0606NNR74P/p1723192165241219?thread_ts=1723191534.743279&cid=C0606NNR74P

risingwavelabs / risingwave