Open juliev0 opened 1 month ago
I understand what the issue is, but the root cause is not on the pod owner references, pod owner reference should always point to the job.
Current situation:
Job
object will be cleaned up before it finishes the buffer cleanup.To solve this problem, there are 2 possible solutions:
Thanks. Makes sense Job would own the Pod. Renamed to describe actual bug in that case.
Found an additional scenario that causes the same problem:
Change the topology of a Pipeline from A->B->C->D
to A->C->D
and then sometime in the next minute change it back. The bucket associated with Vertex B could get cleaned up from the first modification if the Deletion Job is still running while the Creation Job is creating the vertex.
Describe the bug
While the ISB Batch Jobs for creating and cleaning ISB buckets and buffers are Owned by the Pipeline, the Pods themselves aren't.(Striking through this part after clarification from @whynowy below in comment)The result is that if you delete a Pipeline and then re-create the same Pipeline immediately after, the "clean" Job Pod from the first one could be running at the same time that the "create" Job Pod from the second one is running, which can cause the second one to have its buckets and buffers removed and then the Pipeline Pods to be stuck in Pending state waiting indefinitely for those buckets/buffers to be created.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.