Closed juliev0 closed 1 month ago
Here is the numaflow controller log. I added comments in there for when Numaplane paused the pipeline, and then where it updated the Pipeline's spec (scroll down to ~2024-09-23T04:51:07.9)
@KeranYang - do you want to take a look?
@KeranYang - do you want to take a look?
Let me know if you need more information. I've collected the Numaplane log if we need any other timestamp information. I also have the Numaflow log from a comparative good run.
@whynowy sure I will take a look. @juliev0 , I will look into it and we can sync up if I need more informations. Thanks!
@juliev0 , can we constantly re-produce this issue?
No, this is not easily reproducible at all. Most of the time it is successful. But I don't want there to be any occasional intermittent failures. For any occasional intermittent CI issue I see, our CI is now storing:
Pipeline
resource as its spec/status change, with corresponding timelines.Given that these race conditions are not easily reproducible, our plan will likely need to be studying of those artifacts to diagnose what happened.
Have you had a chance to look at my description under ** What I suspect **
? I believe there is an issue in which the Vertex gets updated, we fail reconciliation due to Resource Version conflict error, and then re-reconcile, but at this point our Vertex is no longer the same as it was before, which means our Jobs are incorrect.
So we have it for later, I'm attaching the numaplane log:
I should've linked to the original Github action failure in this issue so we would have the test log as well. This one is not as easy for me to find.
@juliev0 Yes, please share the link to the action failure, just so I can download the numaflow controller logs. Thanks!
@juliev0 Yes, please share the link to the action failure, just so I can download the numaflow controller logs. Thanks!
Unfortunately, I can't easily find this. But I have attached the numaflow controller log itself.
I think I can completely see what happened here: During 2nd reconciliation, the "in" vertex represents the new state, and the "cat" vertex exists now and represents the new state. The "out" vertex failed to get updated, so it represents the old state.
2nd reconciliation state: old buffers: in, cat, out new buffers: in, cat, out Therefore, no new buffers to create.
old buckets: according to "in" vertex: in-cat according to "cat" vertex: in-cat cat-out according to "out" vertex: in-out Buckets according to "pipeline" definition should be: in-cat, cat-out Therefore, we only need to delete in-out.
This is exactly what we see in the log. There was one Deletion Job for in-out and no Creation Job at all
Describe the bug Normally, Numaplane's e2e test passes, but this captures an instance in which it did not.
The sequence of events which Numaplane did:
in->out
toin->cat->out
and keep itsdesiredPhase=Paused
desiredPhase=Running
Normally, this works fine. Normally step 5 causes a Creation Job and a Deletion Job:
(above is extracted from a good run)
However, in this run the Creation Job was not executed, which caused the Daemon Pods unable to get past the init container
isbsvc-validate
check for buffers and buckets.What I suspect In the log I see that the
in
Vertex was successfully updated, thecat
vertex was created, but when theout
vertex was supposed to be created, a Resource Version conflict occurred here. This may have happened prior to the Creation Job being created. This should cause Numaflow Controller to return and then re-reconcile idempotently.However, note that the Creation Job is dependent on newBuffers and newBuckets, which is dependent on these Vertex values. In this bug, the
in
andcat
Vertices were updated successfully on the previous reconciliation. So, the current state of the Vertex no longer reflects the new buffers and new buckets which need to be added.I will add logs.
Environment (please complete the following information):
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.