numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.13k stars 113 forks source link

No "Create Job" executed - Possible Race Condition #2083

Closed juliev0 closed 1 month ago

juliev0 commented 1 month ago

Describe the bug Normally, Numaplane's e2e test passes, but this captures an instance in which it did not.

The sequence of events which Numaplane did:

  1. Create ISBService
  2. Create Pipeline
  3. After Pipeline is running, update it in a trivial way that doesn't require pausing
  4. After Pipeline is updated, pause it so it can be updated
  5. Update Pipeline's topology from in->out to in->cat->out and keep its desiredPhase=Paused
  6. Once topology change is reconciled, update desiredPhase=Running

Normally, this works fine. Normally step 5 causes a Creation Job and a Deletion Job:

{"level":"info","ts":"2024-09-23T20:08:28.842186617Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:337","msg":"Created a job successfully for ISB creating","namespace":"numaplane-system","pipeline":"test-pipeline-rollout","buffers":["numaplane-system-test-pipeline-rollout-cat-0"],"buckets":["numaplane-system-test-pipeline-rollout-cat-out","numaplane-system-test-pipeline-rollout-in-cat"],"servingStreams":[]}
{"level":"info","ts":"2024-09-23T20:08:28.850466284Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:356","msg":"Created ISB Svc deleting job successfully","namespace":"numaplane-system","pipeline":"test-pipeline-rollout","buffers":[],"buckets":["numaplane-system-test-pipeline-rollout-in-out"]}

(above is extracted from a good run)

However, in this run the Creation Job was not executed, which caused the Daemon Pods unable to get past the init container isbsvc-validatecheck for buffers and buckets.

What I suspect In the log I see that the in Vertex was successfully updated, the cat vertex was created, but when the out vertex was supposed to be created, a Resource Version conflict occurred here. This may have happened prior to the Creation Job being created. This should cause Numaflow Controller to return and then re-reconcile idempotently.

However, note that the Creation Job is dependent on newBuffers and newBuckets, which is dependent on these Vertex values. In this bug, the in and cat Vertices were updated successfully on the previous reconciliation. So, the current state of the Vertex no longer reflects the new buffers and new buckets which need to be added.

I will add logs.

Environment (please complete the following information):


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

juliev0 commented 1 month ago

Here is the numaflow controller log. I added comments in there for when Numaplane paused the pipeline, and then where it updated the Pipeline's spec (scroll down to ~2024-09-23T04:51:07.9)

numaflow-controller.log

whynowy commented 1 month ago

@KeranYang - do you want to take a look?

juliev0 commented 1 month ago

@KeranYang - do you want to take a look?

Let me know if you need more information. I've collected the Numaplane log if we need any other timestamp information. I also have the Numaflow log from a comparative good run.

KeranYang commented 1 month ago

@whynowy sure I will take a look. @juliev0 , I will look into it and we can sync up if I need more informations. Thanks!

KeranYang commented 1 month ago

@juliev0 , can we constantly re-produce this issue?

juliev0 commented 1 month ago

No, this is not easily reproducible at all. Most of the time it is successful. But I don't want there to be any occasional intermittent failures. For any occasional intermittent CI issue I see, our CI is now storing:

  1. the numaflow log
  2. the numaplane log
  3. as of today, a "watch" on Pipeline resource as its spec/status change, with corresponding timelines.

Given that these race conditions are not easily reproducible, our plan will likely need to be studying of those artifacts to diagnose what happened.

Have you had a chance to look at my description under ** What I suspect **? I believe there is an issue in which the Vertex gets updated, we fail reconciliation due to Resource Version conflict error, and then re-reconcile, but at this point our Vertex is no longer the same as it was before, which means our Jobs are incorrect.

juliev0 commented 1 month ago

So we have it for later, I'm attaching the numaplane log:

numaplane-controller.log

I should've linked to the original Github action failure in this issue so we would have the test log as well. This one is not as easy for me to find.

KeranYang commented 1 month ago

@juliev0 Yes, please share the link to the action failure, just so I can download the numaflow controller logs. Thanks!

juliev0 commented 1 month ago

@juliev0 Yes, please share the link to the action failure, just so I can download the numaflow controller logs. Thanks!

Unfortunately, I can't easily find this. But I have attached the numaflow controller log itself.

I think I can completely see what happened here: During 2nd reconciliation, the "in" vertex represents the new state, and the "cat" vertex exists now and represents the new state. The "out" vertex failed to get updated, so it represents the old state.

2nd reconciliation state: old buffers: in, cat, out new buffers: in, cat, out Therefore, no new buffers to create.

old buckets: according to "in" vertex: in-cat according to "cat" vertex: in-cat cat-out according to "out" vertex: in-out Buckets according to "pipeline" definition should be: in-cat, cat-out Therefore, we only need to delete in-out.

This is exactly what we see in the log. There was one Deletion Job for in-out and no Creation Job at all