nephio-project / nephio

Nephio is a Kubernetes-based automation platform for deploying and managing highly distributed, interconnected workloads such as 5G Network Functions, and the underlying infrastructure on which those workloads depend.
Apache License 2.0
104 stars 53 forks source link

porch: kpt alpha rpkg get fails when a couple hundred branches #599

Open liamfallon opened 5 months ago

liamfallon commented 5 months ago

Original issue URL: https://github.com/kptdev/kpt/issues/3882 Original issue user: https://github.com/johnbelamaric Original issue created at: 2023-03-14T16:58:44Z Original issue last updated at: 2023-03-16T21:10:45Z Original issue body: ### Expected behavior Valid list of package revisions is returned.

Actual behavior

jbelamaric@jbelamaric:~/proj/tmp/cachingdns-topology$ kpt alpha rpkg get
Error: Get "https://35.192.14.90/apis/porch.kpt.dev/v1alpha1/namespaces/default/packagerevisions": stream error: stream ID 1; INTERNAL_ERROR; received from peer 
jbelamaric@jbelamaric:~/proj/tmp/cachingdns-topology$ k get packagerevisions
Unable to connect to the server: stream error: stream ID 1; INTERNAL_ERROR; received from peer
jbelamaric@jbelamaric:~/proj/tmp/cachingdns-topology$ k get po -n porch-system
NAME                                 READY   STATUS    RESTARTS       AGE
function-runner-77946d6686-jv8kk     1/1     Running   0              5d18h
function-runner-77946d6686-rn57r     1/1     Running   0              5d18h
porch-controllers-5d67bb9fdf-4fs4l   1/1     Running   0              22h
porch-server-78dd559589-qmrvl        1/1     Running   17 (23h ago)   5d

Information

Due to #3877 there are a couple hundred branches after running overnight (see image below).

Porch v0.0.15 kpt v1.0.0-beta.23

image

Steps to reproduce the behavior

Original issue comments: Comment user: https://github.com/johnbelamaric Comment created at: 2023-03-14T16:59:36Z Comment last updated at: 2023-03-14T16:59:36Z Comment body: porch-server.log

Comment user: https://github.com/johnbelamaric Comment created at: 2023-03-14T16:59:55Z Comment last updated at: 2023-03-14T16:59:55Z Comment body: I didn't see any obvious crashes in the porch server logs.

Comment user: https://github.com/johnbelamaric Comment created at: 2023-03-14T18:18:29Z Comment last updated at: 2023-03-14T18:18:29Z Comment body: FYI, I manually deleted all those 200+ branches and now it's working again.

Comment user: https://github.com/natasha41575 Comment created at: 2023-03-16T16:56:28Z Comment last updated at: 2023-03-16T16:57:28Z Comment body: Hmm, not able to reproduce this one either. I thought maybe your packages might be too large but they all seem reasonably small. I tried to reproduce with https://github.com/natasha41575/blueprints (which has 333 branches atm) and it does take a second or two, but kpt alpha rpkg get still works with porch both running in kind and locally.

Might this be similar to https://github.com/GoogleContainerTools/kpt/issues/3877#issuecomment-1470124957, that porch may have entered a strange error state near the beginning? Would you be able to recreate the 200 branches and see if the issue is still there?

If you need a quick way to create the branches, I created my 200 branches by setting in my PV deletionPolicy: orphan and running for i in {1..200}; do kubectl delete -f packagevariant.yaml; sleep 0.5; kubectl apply -f packagevariant.yaml; sleep 0.5; done.

Comment user: https://github.com/johnbelamaric Comment created at: 2023-03-16T17:18:21Z Comment last updated at: 2023-03-16T17:18:21Z Comment body: I wonder if it has to do with running on an autopilot cluster with guaranteed pods (not burstable):

        name: porch-server
        resources:
          limits:
            cpu: 250m
            ephemeral-storage: 1Gi
            memory: 512Mi
          requests:
            cpu: 250m
            ephemeral-storage: 1Gi
            memory: 512Mi

Comment user: https://github.com/natasha41575 Comment created at: 2023-03-16T17:47:29Z Comment last updated at: 2023-03-16T17:48:48Z Comment body: Could you share the memory utilization of your pods to see if it is going over the limits? I spun up an autopilot cluster with the same limits to try it out but again did not hit the same issue.

Comment user: https://github.com/natasha41575 Comment created at: 2023-03-16T21:10:45Z Comment last updated at: 2023-03-16T21:10:45Z Comment body: I said this on the other issue too, but I'm going to try to reproduce your setup with the script you sent me so I can investigate more productively.

liamfallon commented 4 months ago

Triaged Triage Comment: Reproduce this, see how serious it is, part of scaling/stability work

kbijakowski commented 2 weeks ago

I noticed a similar issue during deployment of Free5GC use case with Nephio v.2.0.0. I worked with few (max. 5) branches only.

Sometimes PackageRevision controller didn't respond - kubectl get packagerevision resulted in timeout (similarly to this issue). At the same time attempt of getting e.g PackageVariant or PackageVariantSet was successful. This behavior is not deterministic but it seems that it occurres when Porch is under the load (like during the auto approval process for AMF / SMF / UPF package).

I run Nephio sandbox on bare-metal with 48 vCPU, 384 GB of RAM and 200 GB storage on available partitions.