numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.08k stars 111 forks source link

Certain scenarios cause Buffer cleanup job to run at the same time as the Creation job - race condition #1956

Open juliev0 opened 1 month ago

juliev0 commented 1 month ago

Describe the bug

While the ISB Batch Jobs for creating and cleaning ISB buckets and buffers are Owned by the Pipeline, the Pods themselves aren't. (Striking through this part after clarification from @whynowy below in comment)

The result is that if you delete a Pipeline and then re-create the same Pipeline immediately after, the "clean" Job Pod from the first one could be running at the same time that the "create" Job Pod from the second one is running, which can cause the second one to have its buckets and buffers removed and then the Pipeline Pods to be stuck in Pending state waiting indefinitely for those buckets/buffers to be created.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

whynowy commented 1 month ago

I understand what the issue is, but the root cause is not on the pod owner references, pod owner reference should always point to the job.

Current situation:

  1. Creation job has the owner reference to the pipeline;
  2. Cleanup job does NOT have the reference - this is intentional, because we do not check if the clean up job is done at the pipeline deletion finalizer, in this case, if the reference is in place, the Job object will be cleaned up before it finishes the buffer cleanup.

To solve this problem, there are 2 possible solutions:

  1. At pipeline deletion finalizer, check if the cleanup job is completed - not recommended, it will increase the time for Pipeline deletion;
  2. In non-deletion Pipeline reconciliation, check if there's a cleanup job (not completed) existing.
juliev0 commented 1 month ago

Thanks. Makes sense Job would own the Pod. Renamed to describe actual bug in that case.

juliev0 commented 3 weeks ago

Found an additional scenario that causes the same problem:

Change the topology of a Pipeline from A->B->C->D to A->C->D and then sometime in the next minute change it back. The bucket associated with Vertex B could get cleaned up from the first modification if the Deletion Job is still running while the Creation Job is creating the vertex.