nephio-project / nephio

Nephio is a Kubernetes-based automation platform for deploying and managing highly distributed, interconnected workloads such as 5G Network Functions, and the underlying infrastructure on which those workloads depend.
Apache License 2.0
108 stars 53 forks source link

High Package Approval Time #462

Open arora-sagar opened 11 months ago

arora-sagar commented 11 months ago

While checking for R2, I realized that the package approval time is more than R1. Sometimes even after 15 mins the approval doesn't happen so I have to approve it manually.

I tried it several times and it seems that the issue is reproducible.

@johnbelamaric if needed I can add the logs of the porch controller.

johnbelamaric commented 11 months ago

https://github.com/nephio-project/nephio/pull/452

liamfallon commented 5 months ago

related to https://github.com/nephio-project/nephio/issues/398

liamfallon commented 5 months ago

Triaged Triage Comment: Urgent

liamfallon commented 5 months ago

I have run multiple Nephio end to end tests locally, but I am not seeing this issue on Porch. The packages are approved in a reasonable amount of time seconds/small number of minutes.

However, the resources associated with the packages themselves can take a long time to come up (~10 minutes for cluster creation over capi). This is not a porch issue but an issue in the package itself.

Logging from e2e test to follow once the PR below containing better logging is merged. https://github.com/nephio-project/test-infra/pull/267

arora-sagar commented 5 months ago

I have run multiple Nephio end to end tests locally, but I am not seeing this issue on Porch. The packages are approved in a reasonable amount of time seconds/small number of minutes.

However, the resources associated with the packages themselves can take a long time to come up (~10 minutes for cluster creation over capi). This is not a porch issue but an issue in the package itself.

Logging from e2e test to follow once the PR below containing better logging is merged. nephio-project/test-infra#267

Thank you for looking into this. I totally agree and I learned about it when I was working on Onesummit demo. The problem is mostly related to the package. About cluster creation, I can understand 10 mins. But the same is not the case when NF package has a dependencies CR. So it can be coming from functions, interface-fn, nfdeploy-fn or dnn-fn.

I hope logging is able to show us the reason.

liamfallon commented 5 months ago

Below is the timestamps extracted from the log from this test run on Saturday:

approve.csv

liamfallon commented 5 months ago

This PR generates the CSV from the log. https://github.com/nephio-project/test-infra/pull/268

electrocucaracha commented 5 months ago

The run_test function generates some metrics (test execution time and MB downloaded). There are also some warning messages triggered when the waiting time exceeds 2/3 of expected (https://github.com/nephio-project/test-infra/blob/main/e2e/lib/k8s.sh#L44)

electrocucaracha commented 5 months ago

This is a sample of a recent execution

Test tasks Time Download
001.sh 591 secs 1,105 MB
002.sh 1038 secs 2,212 MB
003.sh 47 secs 177 MB
004.sh 100 secs 222 MB
005.sh 664 secs 118 MB
006.sh 855 secs 31 MB
007.sh 1019 secs 143 MB
008.sh 321 secs 10 MB
009.sh 17 secs 0 MB

BTW, the sandbox creation takes ~10 min

efiacor commented 3 months ago

Some observations from debugging.

Testing using the UPF edge deploy flow - https://github.com/nephio-project/test-infra/blob/main/e2e/tests/free5gc/005.sh

The mutator pipeline execution can take some time (~5 min) and the conditions related to the readiness gates are updated to true in good time.

But the delay appears to be related to the approval controller not "getting" the latest packagerevision until at least 5 mins later. https://github.com/nephio-project/nephio/blob/main/controllers/pkg/reconcilers/approval/reconciler.go#L94

and fails to verify the readiness gates - https://github.com/nephio-project/nephio/blob/main/controllers/pkg/reconcilers/approval/reconciler.go#L131

With some expertise from others, we determined that this may be related to how the Watch notification is impl in porch. https://github.com/nephio-project/porch/blob/main/pkg/registry/porch/watch.go

@johnbelamaric Would you have some time to direct us on the best approach here? Of all the existing controllers in the nephio repo, this one is at least of some use to generic users of the nephio operator so it would be great if we could improve it.

johnbelamaric commented 3 months ago

@johnbelamaric Would you have some time to direct us on the best approach here?

I am not sure I have a great answer here. It's hard to know without more investigation as to exactly what is going on. Can you track the propagation of the change in the cases where you are seeing delays? That is, if approval controller is not seeing the change, is it stuck on an older resource version (commit)? Or is the problem that the Porch API server is somehow not seeing the change?

If it's the first one (porch API sees the change, approval controller doesn't), then the issues is either a watch event that failed to publish (porch issue) or one that somehow was missed by the approval controller.

If it's the second one (porch API does not see the change), then we need to look deeper into how there is a mismatch between Git and the Porch API server. I did have a PR that touched this area, that has never merged (I merged it a long time ago but the Google internal TNA team found some issues so I backed it out): https://github.com/nephio-project/porch/pull/8. That covers how the internal cache is maintained withing Porch and was primarily focused on performance degradation over time. But it may be relevant in this case as it pertains to consistency of the cache.

efiacor commented 3 months ago

Ok. Some progress, I think. If we change the packagerev Get in the approval controller to use a Reader, we get a substantial reduction in timings for the free5gc e2e run.

TEST SUMMARY

001.sh: PASS in 182 seconds 002.sh: PASS in 315 seconds 003.sh: PASS in 50 seconds 004.sh: PASS in 48 seconds 005.sh: PASS in 273 seconds 006.sh: PASS in 464 seconds 007.sh: PASS in 413 seconds 008.sh: PASS in 307 seconds 009.sh: PASS in 31 seconds


This is an just a single run and I did observe that the amf pkg got stuck in a proposed state and needed to be push on manually.

johnbelamaric commented 3 months ago

That looks a lot better!

efiacor commented 3 months ago

Second run on fresh VM.

TEST SUMMARY

001.sh: PASS in 191 seconds 002.sh: PASS in 290 seconds 003.sh: PASS in 45 seconds 004.sh: PASS in 52 seconds 005.sh: PASS in 374 seconds 006.sh: PASS in 400 seconds 007.sh: PASS in 565 seconds 008.sh: PASS in 55 seconds 009.sh: PASS in 25 seconds


Not very consistent so far, but better timings.

We still have an issue with the amf/smf deploy. Can't see why but the amf pkg gets approved quite early and then gets stuck trying to publish a v2 revision: err approval policy "initial" not met

See git commit graph: Screenshot from 2024-08-13 23-36-59

johnbelamaric commented 3 months ago

I think the "initial" policy won't approve a v2! Thus the name - it approves an initial version but not subsequent edits. Looks like previously it only worked because some initial deltas got batched into a single version. Now with faster approvals, the approval controller is winning a race it used to lose!

Hacky workaround: add an approval delay annotation to that package to delay approval a little bit. It's akin to fixing a race condition with a sleep - bad practice but may help temporarily.

We will need a better policy than "initial" to handle this properly.

efiacor commented 3 months ago

Added a 1m delay to the amf pvs. It gets proposed along the lines somewhere and then stalls with:

events owning PackageVariant not Ready

Checked the pv and the status is:

https://github.com/nephio-project/porch/blob/main/pkg/engine/engine.go#L992

efiacor commented 2 months ago

Tweaked the requeue duration to 15 secs. Added a 3 min delay on the AMF & SMF PVs. All ran though without manual intervention.

TEST SUMMARY

001.sh: PASS in 144 seconds 002.sh: PASS in 277 seconds 003.sh: PASS in 63 seconds 004.sh: PASS in 58 seconds 005.sh: PASS in 748 seconds 006.sh: PASS in 412 seconds 007.sh: PASS in 619 seconds 008.sh: PASS in 48 seconds 009.sh: PASS in 21 seconds


kispaljr commented 2 months ago

Much better numbers! :) Congratulations!

kispaljr commented 2 months ago

Just one minor question: have you considered using the existing porchRESTClient for the Get() call in the approval controller to bypass the cache, instead of introducing yet another client?