kptdev / kpt

Automate Kubernetes Configuration Editing
https://kpt.dev
Apache License 2.0
1.7k stars 228 forks source link

live apply stuck #3395

Open bharathkkb opened 2 years ago

bharathkkb commented 2 years ago

Expected behavior

live apply proceeds after explicit depends-on target is reconciled

Actual behavior

live apply stalls after explicit depends-on target is reconciled

Information

Scenario 1

PubSubTopic resource has a dependency on Project via a depends-on annotation. LoggingLogSink is dependent on PubSubTopic resource via pubSubTopicRef in KCC but has no depends-on annotation.

When we do a kpt live apply this is the output

installing inventory ResourceGroup CRD.
inventory update started
inventory update finished
apply phase started
configmap/setters apply skipped: inventory policy prevented actuation (strategy: Apply, status: NoMatch, policy: MustMatch)
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink apply successful
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID apply successful
apply phase finished
reconcile phase started
configmap/setters reconcile skipped
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink reconcile pending
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID reconcile successful
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID reconcile pending
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID reconcile successful

It seems to be stuck after project is reconciled. Running live status in another window.

kpt live status --poll-until=forever
pubsubtopic.pubsub.cnrm.cloud.google.com/PROJECT_ID-ps-dataset is NotFound: Resource not found
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID is InProgress: Update in progress
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink is InProgress: reference PubSubTopic config-control/PROJECT_ID-ps-dataset is not found
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID is Current: Resource is Current

If I exit and reapply it still seems to be stuck.

installing inventory ResourceGroup CRD.
inventory update started
inventory update finished
apply phase started
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink apply successful
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID apply successful
apply phase finished
reconcile phase started
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink reconcile pending
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID reconcile successful

My suspicion is it is waiting for logginglogsink to reconcile since it applied it in the initial apply phase. However as logginglogsink has pubSubTopicRef to the pubsub resource it is unable to make progress. If this is the case, I believe kpt should just be looking at just explicit deps when trying to decide next apply phase.

Scenario 2

If I add an explicit depends-on to LoggingLogSink resource to wait for PubSubTopic resource and start fresh apply, it seems to complete immediately without waiting for resources to reconcile although it reports it had reconciled. Note - this was another brand new project not a continuation of scenario 1.

installing inventory ResourceGroup CRD.
inventory update started
inventory update finished
apply phase started
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID apply successful
apply phase finished
reconcile phase started
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID reconcile successful
reconcile phase finished
apply phase started
pubsubtopic.pubsub.cnrm.cloud.google.com/PROJECT_ID-ps-dataset apply successful
apply phase finished
reconcile phase started
pubsubtopic.pubsub.cnrm.cloud.google.com/PROJECT_ID-ps-dataset reconcile successful
reconcile phase finished
apply phase started
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink apply successful
apply phase finished
reconcile phase started
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink reconcile successful
reconcile phase finished
inventory update started
inventory update finished
apply result: 3 attempted, 3 successful, 0 skipped, 0 failed
reconcile result: 3 attempted, 3 successful, 0 skipped, 0 failed, 0 timed out

Running a live status right after this shows project is still reconciling and other are erroring as project is not created yet.

pubsubtopic.pubsub.cnrm.cloud.google.com/PROJECT_ID-ps-dataset is Failed: Update call failed: error applying desired state: summary: Error creating Topic: googleapi: Error 404: Requested project not found or user does not have access to it (project=PROJECT_ID). Make sure to specify the unique project identifier and not the Google Cloud Console display name.
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID is InProgress: Update in progress
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink is InProgress: reference PubSubTopic config-control/PROJECT_ID-ps-dataset is not ready
project.resourcemanager.cnrm.cloud.google.com/PROJECT_ID is Current: Resource is Current
pubsubtopic.pubsub.cnrm.cloud.google.com/PROJECT_ID-ps-dataset is InProgress: Update in progress
pubsubtopic.pubsub.cnrm.cloud.google.com/PROJECT_ID-ps-dataset is Current: Resource is Current
logginglogsink.logging.cnrm.cloud.google.com/PROJECT_ID-pubsubsink is Current: Resource is Current

Eventually everything reconciles while behaving as if everything were applied at once without depends-on.

Kpt Version: 1.0.0-beta.17 Kpt Package that can demonstrate the error: https://github.com/bharathkkb/kpt-live-depends-issue

bharathkkb commented 2 years ago

/cc @karlkfi if you have any ideas or my yaml is misconfigured

karlkfi commented 2 years ago

The behavior Scenario 1 seems expected. Without the depends-on, kpt doesn't know the LoggingLogSink depends on the PubSubTopic, and the LoggingLogSink won't become reconciled until after the PubSubTopic is applied and reconciled. So kpt applies the Project and LoggingLogSink and waits forever for the LoggingLogSink to reconcile, which it won't, because it hasn't been applied.

You can make it time out with the --reconcile-timeout flag, but by default it waits forever.

The behavior of Scenario 2 seems to imply that Project, PubSubTopic, and LoggingLogSink only wait for their dependencies to exist in KRM and not GCP before being recognized as reconciled. But I'll need to try it to know for sure. Or you can paste the object YAML with their full status at the end of the kpt live apply using something like kubectl get -f ./ -o yaml.

I'm guessing that kstatus is recognizing KCC dependency errors as not-reconciled, but isn't recognizing whatever the in-progress condition is as not-reconciled. There's a number of "standard" ways kstatus recognizes reconciliation, and KCC may not have implemented them consistently. At least, that's my guess for now. It could also be a bug, but I'd like to rule out occam's razor first.

karlkfi commented 2 years ago

FWIW, if you remove all the depends-on, it should work fine too. It would be similar to Scenario 2, except all in one apply & reconcile phase.

But if the KCC objects are being detected as reconciled when they're not in GCP yet, we should probably fix that, either in KCC or kstatus.

bharathkkb commented 2 years ago

Thank for looking into this @karlkfi!

The behavior Scenario 1 seems expected. Without the depends-on, kpt doesn't know the LoggingLogSink depends on the PubSubTopic, and the LoggingLogSink won't become reconciled until after the PubSubTopic is applied and reconciled. So kpt applies the Project and LoggingLogSink and waits forever for the LoggingLogSink to reconcile, which it won't, because it hasn't been applied.

Wouldn't it be better to proceed to applying PubSubTopic as soon as Project is resolved as LoggingLogSink is not a dependency?

The behavior of Scenario 2 seems to imply that Project, PubSubTopic, and LoggingLogSink only wait for their dependencies to exist in KRM

Right, and whats odd is live status does seem to pick up the UpdateFailed status. I will try to grab some logs tomorrow.

FWIW, if you remove all the depends-on, it should work fine too. It would be similar to Scenario 2, except all in one apply & reconcile phase.

Yeah this is what we currently have but hoping to leverage depends on to reduce some of the log verbosity for automation when we have alot of resources.

karlkfi commented 2 years ago

Wouldn't it be better to proceed to applying PubSubTopic as soon as Project is resolved as LoggingLogSink is not a dependency?

Yes. It’s been on my TODO list to rewrite the task scheduler to use an asynchronous dependency graph, but for now the implication is a graph flattened into phases. So if a reconcile phase doesn’t succeed, it doesn’t continue.

If you put a reconcile timeout on it, it might actually succeed, but the task scheduler doesn’t know that.