vmware-tanzu / kubeapps

A web-based UI for deploying and managing applications in Kubernetes clusters
Other
4.99k stars 706 forks source link

Investigate and fix spurious test failures causing failing badges on the main README #6052

Closed absoludity closed 1 year ago

absoludity commented 1 year ago

Summary Currently there are two spurious failures that appear to be getting regular enough that we should fix them, as they can result in our Main Pipeline badge being red.

There is also a failure on the full integration pipeline, because it tries to create a PR for the latest upstream chart changes sync, but one already exists (ie. because we haven't yet merged it), this step fails, causing the full integration pipeline to fail.

Background and rationale We don't want red failing badges on our main README.

Description

The four failures I'm aware of, 3 spurious, one not, are:

Javascript Heap out of memory while running linter

Example

$ eslint --config ./.eslintrc.json 'src/**/*.{js,ts,tsx}' --max-warnings=0

<--- Last few GCs --->

[1862:0x5628290]    90233 ms: Scavenge (reduce) 495.7 (509.8) -> 495.2 (510.1) MB, 16.1 / 0.0 ms  (average mu = 0.222, current mu = 0.292) allocation failure 
[1862:0x5628290]    90321 ms: Scavenge (reduce) 496.1 (510.1) -> 495.3 (510.1) MB, 12.6 / 0.0 ms  (average mu = 0.222, current mu = 0.292) allocation failure 
[1862:0x5628290]    90433 ms: Scavenge (reduce) 496.2 (510.1) -> 495.4 (510.1) MB, 5.0 / 0.0 ms  (average mu = 0.222, current mu = 0.292) allocation failure 

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb08e80 node::Abort() [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 2: 0xa1b70e  [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 3: 0xce1890 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 4: 0xce1c37 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 5: 0xe992a5  [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 6: 0xe99d86  [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 7: 0xea82ae  [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 8: 0xea8cf0 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
 9: 0xeabc6e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
10: 0xe6cee2 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
11: 0xe654f4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
12: 0xe66661 v8::internal::FactoryBase<v8::internal::Factory>::NewByteArray(int, v8::internal::AllocationType) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
13: 0xde4ab3 v8::internal::TranslationArrayBuilder::ToTranslationArray(v8::internal::Factory*) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
14: 0x1c19af6 v8::internal::compiler::CodeGenerator::GenerateDeoptimizationData() [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
15: 0x1c1a205 v8::internal::compiler::CodeGenerator::FinalizeCode() [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
16: 0x1ca26c1 v8::internal::compiler::PipelineImpl::FinalizeCode(bool) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
17: 0x1ca36c3 v8::internal::compiler::PipelineCompilationJob::FinalizeJobImpl(v8::internal::Isolate*) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
18: 0xd79200 v8::internal::OptimizedCompilationJob::FinalizeJob(v8::internal::Isolate*) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
19: 0xd7ddbb v8::internal::Compiler::FinalizeOptimizedCompilationJob(v8::internal::OptimizedCompilationJob*, v8::internal::Isolate*) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
20: 0xda00c3 v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions() [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
21: 0xe390f7 v8::internal::StackGuard::HandleInterrupts() [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
22: 0x11e56e5 v8::internal::Runtime_StackGuard(int, unsigned long*, v8::internal::Isolate*) [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
23: 0x15d9c19  [/opt/hostedtoolcache/node/16.19.0/x64/bin/node]
Aborted (core dumped)
error Command failed with exit code 134.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
ERROR: "eslint-check" exited with 134.
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
Error: Process completed with exit code 1.

Error rolling out flux-system during flux e2e tests

Example

 INFO  ==> Checking rollout status in deployment helm-controller in ns flux-system
INFO  ==> Checking rollout status in deployment source-controller in ns flux-system
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 5)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 4)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 3)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 2)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 1)
INFO  ==> Error while rolling out deployment source-controller in ns flux-system

From the output that follows, the deployment has gone ahead, but the liveness and readiness checks are failing:

     Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3

Error rolling out e2e-runner

Example

Oddly, I can't see anything wrong with the output (all successful)

INFO  ==> Using E2E runner image 'kubeapps/integration-tests-ci:build-21b53ce33127deb300de4da79123e75745ca1f67'
deployment.apps/e2e-runner created
INFO  ==> Checking rollout status in deployment e2e-runner in ns default
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 5)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 4)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 3)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 2)
INFO  ==> Attempt failed, retrying after 10... (remaining attempts: 1)
INFO  ==> Error while rolling out deployment e2e-runner in ns default
Name:                   e2e-runner
Namespace:              default
CreationTimestamp:      Sat, 04 Mar 2023 04:19:51 +0000
Labels:                 app=e2e-runner
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=e2e-runner
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=e2e-runner
  Containers:
   integration-tests-ci:
    Image:        kubeapps/integration-tests-ci:build-21b53ce33127deb300de4da79123e75745ca1f67
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   e2e-runner-55786545cd (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  113s  deployment-controller  Scaled up replica set e2e-runner-55786545cd to 1
Name:           e2e-runner-55786545cd
Namespace:      default
Selector:       app=e2e-runner,pod-template-hash=55786545cd
Labels:         app=e2e-runner
                pod-template-hash=55786545cd
Annotations:    deployment.kubernetes.io/desired-replicas: 1
                deployment.kubernetes.io/max-replicas: 2
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/e2e-runner
Replicas:       1 current / 1 desired
Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=e2e-runner
           pod-template-hash=55786545cd
  Containers:
   integration-tests-ci:
    Image:        kubeapps/integration-tests-ci:build-21b53ce33127deb300de4da79123e75745ca1f67
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From                   Message
  ----    ------            ----  ----                   -------
  Normal  SuccessfulCreate  114s  replicaset-controller  Created pod: e2e-runner-55786545cd-p6zk6

NAME                          READY   STATUS    RESTARTS   AGE
e2e-runner-55786545cd-p6zk6   1/1     Running   0          114s
Name:         e2e-runner-55786545cd-p6zk6
Namespace:    default
Priority:     0
Node:         gke-kubeapps-test-main-0-default-pool-110aa05b-k6hz/10.142.15.229
Start Time:   Sat, 04 Mar 2023 04:19:51 +0000
Labels:       app=e2e-runner
              pod-template-hash=55786545cd
Annotations:  <none>
Status:       Running
IP:           10.20.1.14
IPs:
  IP:           10.20.1.14
Controlled By:  ReplicaSet/e2e-runner-55786545cd
Containers:
  integration-tests-ci:
    Container ID:   containerd://87186461e5a3952c5f3a48fca01def9d8b704ac551b7722acf11a078ead05081
    Image:          kubeapps/integration-tests-ci:build-21b53ce33127deb300de4da79123e75745ca1f67
    Image ID:       docker.io/kubeapps/integration-tests-ci@sha256:87d70f986cd832147cb037144d90eafcb19352c3c93fbd7d589cbdc5316dc09d
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Sat, 04 Mar 2023 04:20:54 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rp6gp (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-rp6gp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  115s  default-scheduler  Successfully assigned default/e2e-runner-55786545cd-p6zk6 to gke-kubeapps-test-main-0-default-pool-110aa05b-k6hz
  Normal  Pulling    114s  kubelet            Pulling image "kubeapps/integration-tests-ci:build-21b53ce33127deb300de4da79123e75745ca1f67"
  Normal  Pulled     52s   kubelet            Successfully pulled image "kubeapps/integration-tests-ci:build-21b53ce33127deb300de4da79123e75745ca1f67" in 1m2.224020431s
  Normal  Created    52s   kubelet            Created container integration-tests-ci
  Normal  Started    52s   kubelet            Started container integration-tests-ci
Error: Process completed with exit code 1.

sync-chart-changes already exists

Example

If we're not quick to merge the generated sync charts, the full integration pipeline fails with:

Switched to a new branch 'sync-chart-changes-12.2.7'
[sync-chart-changes-12.2.7 fc68b701d] bump chart version to 12.2.7
 4 files changed, 12 insertions(+), 10 deletions(-)
The remote branch 'sync-chart-changes-12.2.7' already exists, please check if there is already an open PR at the repository 'vmware-tanzu/kubeapps'
Error: Process completed with exit code 1.

Dex service fails when setting up multicluster

Saw another one today:

Run ./script/install-multicluster-deps.sh
"dex" has been added to your repositories
namespace/dex created
Error: Service "dex" is invalid: spec.ports[1].nodePort: Invalid value: 32000: provided port is already allocated
Error: Process completed with exit code 1.

Acceptance criteria A formalized list of conditions that ensure that the issue can be considered as finished.

Additional context Add any other context or screenshots about the issue here.

absoludity commented 1 year ago

One down: the memory issue for our dashboard lint was simply because the number of js/ts files was increasing. The simple solution was to stop linting generated code, so in #6031 I added an --ignore-pattern='src/gen/' which has fixed the issue (reducing the memory requirement for now). I'm not sure why we're linting/prettifying generated code in the first place, let me know if I missed an important reason :)

absoludity commented 1 year ago

For reasons, the javascript heap error is happening much more frequently the last day or two. I'll try to investigate tomorrow.

absoludity commented 1 year ago

After some investigation, JS error fixed (for now) with #6459