linkedin / brooklin

An extensible distributed system for reliable nearline data streaming at scale
BSD 2-Clause "Simplified" License
922 stars 137 forks source link

Only move datastream to stopped state if it is still in STOPPING state #974

Closed hshukla closed 10 months ago

hshukla commented 11 months ago

When the datastream stop and resume call happens, the datastream will transition to READY -> STOPPING -> STOPPED state, transitioning from STOPPING TO STOPPED is an async process. The LEADER_DO_ASSIGNMENT has a logic on fetching all the datastream on STOPPING state and work on the clean up whenever it gets called, does not matter on the event type, as it is unaware of the type/operation on which it got called.

Sometimes, this behavior is causing the more work and stopping other data streams twice(as there would be one more call later on for stopping datastream).

While this behavior was fine before, recently, with restart improvement work, as mentioned above, we now perform these clean up operation in an async fashion, which leads concurrent processing. The issue happens when the resume comes after stop and this behavior marking datastream to STOPPED state instead of marking it to READY, as the transition happens like READY -> STOPPING -> STOPPED -> READY -> againSTOPPED

Example

When multiple datastreams stop requests come together, each will be put on the event queue separately. The leader coordinator then de-ques those events. Here lets say, Stop comes for datastream_A at 0th second -> event will be in event queue; eventId=ABC Stop comes for datastream_B at 10th second -> event will be in event queue; eventId=XYZ

now if leader coordiator de-ques the eventId=ABC at 11th second, it will perform the stop work for both datastreams; datastream_A and datastream_B and marks both of them as STOPPED(lets say at 30th second)

during this time the leader coordinator de-ques the eventId=XYZ (at 25th second), and notices that it is still in STOPPING state and starts the clean up process, this is happening async manner.

meanwhile, restart request for datastream_B comes at lets say 31th(since it was marked STOPPED at 30th second) and the receiving host marks it as READY (at 35th second)

now, on the leader host, the thread which was processing eventId=XYZ finishes the work and marks the datatream_B to STOPPED state again. It looks like we do check the status of datastream before marking it to STOPPED, but that status comes from cache, and cache sometimes/ usually (as these requests are happening pretty close to each other) is stale.

Solution

In order to speed up the fix for this regression, the quick light weight fix is to introduce ZK lookup for datastream status and mark it to STOPPED only if it is in STOPPING state, this will also invalid the cache and update the stale value in cache. This solution will still be useful along with the long term/perm fix described following.

For long term, the real issue here is, we are doing more work or extra work than we should be, we should introduce the in-memory data structure for STOP events that being processed and make sure the same task is being assigned to only one time. In the case of failure, we should call LEADER_DO_ASSIGMENT again. This will make sure that the problem does not occur at the first place. This change is tricky and would be mindful while being made, so the plan is to observe the system with this PR change first.