jfrog / charts

JFrog official Helm Charts
https://jfrog.com/integration/helm-repository/
Apache License 2.0
249 stars 436 forks source link

Max retries exceeded with url ... oo many 503 error responses #1899

Closed amir-bialek closed 1 week ago

amir-bialek commented 1 week ago

Is this a request for help?:


Version of Helm and Kubernetes: 1.29

Which chart: artifactory-cpp-ce 107.77.8

Which product license (Enterprise/Pro/oss): Community

JFrog support reference (if already raised with support team):

What happened: Deployed artifactory-cpp-ce helm on on-prem k8s with the following values:

artifactory:

  nginx:
    enabled: false

  ingress:
    enabled: true
    className: "my-ingress-class"
    hosts:
      - my-host-address
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "0"

    tls: 
    - secretName: my-cert
      hosts:
        - my-host-address

  nameOverride: artifactory
  fullnameOverride: artifactory
  artifactory:
    persistence:
      size: 50Gi
  postgresql:
    enabled: false

postgresql:
  enabled: false

services are accessing conan directly with: artifactory.default.svc.cluster.local

And I am getting this error too often:

conans.errors.ConanException: HTTPConnectionPool(host='artifactory.default.svc.cluster.local', port=8082): Max retries exceeded with url: /artifactory/api/conan/myartifactory/v1/ping (Caused by ResponseError('too many 503 error responses')).

Recently it is happening too many times to ignore On the logs I see:

2024/07/07 11:47:04 httputil: ReverseProxy read error during body copy: stream error: stream ID 673843; CANCEL; received from peer

I can try to add:

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 2

(by default it is set to false)

And / or add postgresql:

postgresql:
  enabled: true

And / or add nginx:

  nginx:
    enabled: true

And / or update helm to version 107.84.16 , currently it is on 107.77.8

The thing is, all the software team are using this db, so every change will block them..

Also can someone advice what is the use of postgresql in this chart ?

If anyone can help, will appreciate it.

gitta-jfrog commented 1 week ago

Hi @amir-bialek You mentioned several questions in this issue, if I didn't cover all of them, please let me know.

  1. Disabling PostgreSQL - You should use PostgreSQL when running Artifactory on k8s. We are not supporting k8s deployments With Derby database (the default database configured)

  2. autoscaling - You should not use it as it will work only when you have a valid Enterprise/Enterprise Plus license which support High Availability deployments.

  3. if you decided to use the ingress method and disable Nginx, you should install nginx-ingress controller in your cluster. https://jfrog.com/help/r/jfrog-installation-setup-documentation/run-ingress-behind-another-load-balancer

amir-bialek commented 1 week ago

Hey @gitta-jfrog
Thank you for the reply.

  1. Understood, thank you. I am deploying now a new artifactory with the new helm + PostgreSQL, and will try to import the data from the 'live' artifactory into 'new' artifactory, then will make the switch (system backup and restore). Can you advice why the default PostgreSQL pvc is 200GB, but the artifactory pvc is only 20GB? shouldn't it be the opposite?

  2. Understood, thank you.

  3. Nginx ingress controller is install on the cluster and the ingress to conan svc is working well. Note that the specific call is happening without the ingress -> it is from another svc in k8s so it is calling conan directly with artifactory.default.svc.cluster.local:8082.

gitta-jfrog commented 1 week ago
  1. Indeed the defaults here should be tuned. You can change it according to your needs. Assuming you are storing your binaries on the PVC itself, the Artifactory Filestore will be definitely bigger than the DB size.

  2. I understand. So your client is reaching Artifactory SVC directly. I think the 503 errors you are seeing might be related to the resources allocated to Artifactory Service. What is the size of the node running Artifactory pod? can you see pod restarts? Anything in artifactory-service.log (/opt/jfrog/artifactory/var/log) that indicate resource of crashing of the JVM? How many incoming requests in parallel you are running?

amir-bialek commented 1 week ago

Hey, Artifactory is running on worker1 , it have plenty of resources. kubectl top node give me: NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
master 828m 6% 5952Mi 37%
worker1 2286m 6% 17547Mi 30%

There is no pod-restarts or any special errors, other then:

2024/07/07 11:47:04 httputil: ReverseProxy read error during body copy: stream error: stream ID 673843; CANCEL; received from peer

Which I do see a lot.

Looking at Grafana dashboard for pod resources, the CPU and memory are steady, I do not see any jump in the past 5 days. At the moment the pod have no resources requests and limits (default settings). I did try to add:

  artifactory:
    resources: 
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "6Gi"
        cpu: "1"
    javaOpts: 
      xms: "1g"
      xmx: "5g"

And I verified in the logs that it received the new Xmx, but it still reproduce the 503.

I do see a jump in Bandwidth and Packets: image

amir-bialek commented 1 week ago

Hey, after uploading the new Artifactory with the following:


artifactory:
  nginx:
    enabled: false
  ingress:
    enabled: true
    className: "my-class"
    hosts:
      - my-host1
      - my-host2
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "0"
    tls: 
    - secretName: dev-cert
      hosts:
        - my-host1
        - my-host2

  nameOverride: artifactory
  fullnameOverride: artifactory

  artifactory:
    persistence:
      accessMode: ReadWriteOnce
      size: 200Gi

  postgresql:
    persistence:
      enabled: true
      size: 20Gi

I do not see the error. but please leave this case open for another 2-3 days so that I can verify the problem is solved by using postgresql

gitta-jfrog commented 1 week ago

Great I'm glad you managed to move to PostgreSQL. The use of Artifactory with PostgreSQL will allow you multiple connections to the DB (compare to single connection allowed when using Derby) and that should improve the system behavior. I'll keep this open for the next few days.

Thanks