HPA kills new pods instantly after creation

lorrx commented 2 months ago

Describe your Issue

When I activate HPA in the helmet chart, a pod is initially scheduled (which is correct). As soon as I synchronize files, the CPU load of this pod increases to >60%. So HPA tries to schedule new pods. These are then also scheduled, but killed again immediately after container creation. This means that there is never more than one pod running at the same time, although there should be 5.

Logs and Errors

There are no errors in the logs. The termination seems to be caused by Kubernetes itself.

Describe your Environment

Kubernetes distribution: k3s
Helm Version (or App that manages helm): ArgoCD version v2.10.7+b060053
Helm Chart Version: 4.6.6
values.yaml:

nextcloud:
  host: 10.3.28.0
  configs:
    custom.config.php: |
      <?php
        $CONFIG = array(
          "check_data_directory_permissions"=> false, # fix data directory permissions error
          "trusted_domains" => array (
            $_ENV["NEXTCLOUD_TRUSTED_DOMAINS"], # fix probes 400 error
          ),
          'trusted_proxies' => array(
            0 => '127.0.0.1',
            1 => '10.0.0.0/8',
          ),
          "forwarded_for_headers" => array("HTTP_X_FORWARDED_FOR"),
        );
  containerPort: 8080
  extraVolumes:
    - name: nginx-cache
      emptyDir: { }
  extraVolumeMounts:
    - name: nginx-cache
      mountPath: "/var/cache/nginx" # fix permission denied error
  securityContext:
    runAsUser: 901000
    runAsGroup: 901000
    runAsNonRoot: true
  podSecurityContext:
    runAsUser: 901000
    runAsGroup: 901000
    runAsNonRoot: true
service:
  type: LoadBalancer
internalDatabase:
  enabled: false
image:
  flavor: fpm
nginx:
  enabled: true
  image:
    repository: nginxinc/nginx-unprivileged
    tag: 1.25 # https://hub.docker.com/r/nginxinc/nginx-unprivileged/tags
  containerPort: 8080
  resources:
    limits:
      cpu: 200m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 64Mi
  securityContext:
    runAsUser: 901000
    runAsGroup: 901000
    runAsNonRoot: true
externalDatabase:
  enabled: true
  type: postgresql
  host: nextcloud-postgresql-primary
  database: nextcloud
  user: nextcloud
  password: nextcloud
hpa:
  enabled: true
  minPods: 1
  maxPods: 5
resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi
persistence:
  enabled: true
  existingClaim: pvc-k8s-nextcloud-app
  nextcloudData:
    enabled: true
    existingClaim: pvc-k8s-nextcloud-data
livenessProbe:
  enabled: true
readinessProbe:
  enabled: true
startupProbe:
  enabled: true

Additional context, if any

I found a possible solution for this issue. As mentioned in this StackOverflow article, the replicas parameter cannot be used in the deployment resource if a HPA definition is used.

I am using NFS as persistant storage with the NFS CSI driver. The PVC has RWX access mode.

lorrx commented 2 months ago

Additional information: The problem seems to be in ArgoCD. When disabling the self-heal option, all replicas are scheduled as exepted.

I suspect that ArgoCD detects a diff in the deployment.yml (replicas: 1 != replicas: 3) and then adjusts this again. So removing the replicas option when HPA is enabled could really be the solution.

jessebot commented 1 month ago

@lorrx thanks for issue and the updates 🙏 , if you have found a solution that works both in Argo CD and via helm directly on a k8s cluster, please feel free to submit a PR to correct the issue.

nextcloud / helm