goharbor / harbor-helm

The helm chart to deploy Harbor
Apache License 2.0
1.17k stars 760 forks source link

Database connection limits reached when GC is run #1708

Open nmcostello opened 7 months ago

nmcostello commented 7 months ago

Hi,

We have harbor-helm deployed on a K8s cluster with RDS and S3 as the data backend. We currently have begun seeing an issue where GC is run and it takes up all of the available connections on the RDS cluster. This results in being unable to interact with harbor via UI, API, or OCI clients. The connections are eventually freed after ~5 hours, but during that time harbor is inoperable.

Please let me know if this is a better issue for the main harbor repo.

We have paused the GC schedule for the time being.

Harbor helm chart version: 1.11.1 Harbor version: v2.7.1-6015b3ef

DB Connection Values:

              maxIdleConns: 4
              maxOpenConns: 14

At the time the connections were overwhelmed, we had ~80 core + exporter pods running. By my calculations, that only equates to ~1100 connections which is no where near the 5k that we saw at the time. Any thoughts on this?

Pics

In the pictures below, you can see that the connections to the DB are more than what they should be according to harbor docs that max connections = [maxOpenConns] * (core+exporter). The spike in pod count around 2130 is the result of my interventions and is well after we hit max connections on the db.

Screenshot 2024-02-15 at 4 08 54 PM

Screenshot 2024-02-15 at 4 10 26 PM

Vad1mo commented 6 months ago

Why do you run 80 core pods? Are piping the s3 traffic via core. (Disable redirect on docker distribution?)

One can do quite some optimization with indexes and caches.

However this won't solve the GC issue. It's a fundamental problem inherited by harbor from docker distribution.

zyyw commented 6 months ago

maybe you need to update the db connection to a larger number, reference:

nmcostello commented 6 months ago

Why do you run 80 core pods? Are piping the s3 traffic via core. (Disable redirect on docker distribution?)

One can do quite some optimization with indexes and caches.

However this won't solve the GC issue. It's a fundamental problem inherited by harbor from docker distribution.

@Vad1mo We aren't doing anything special. Our pods spike to 80 during the day with the traffic that we see. But if there are ways to optimize this I would love to read about it. Let me paste our s3 configs...

          {{- if .Values.s3 }}
              s3:
                {{- if and (ne .Values.environment "internal") (ne .Values.environment "internal-test") }}
                existingSecret: {{ .Values.targetNamespace }}-secret
                {{- end }}
                region: {{ .Values.s3.region }}
                bucket: {{ .Values.s3.bucket }}
                accesskey: managed-by-sealed-secret
                secretkey: managed-by-sealed-secret
                regionendpoint: {{ .Values.s3.regionendpoint }}

                encrypt: ""
                keyid: ""
                secure: ""
                skipverify: true
                v4auth: ""
                chunksize: "5242880"
                rootdirectory: ""
                storageclass: STANDARD
                multipartcopychunksize: "33554432"
                multipartcopymaxconcurrency: 100
                multipartcopythresholdsize: "33554432"
          {{- end }}
Vad1mo commented 6 months ago

image this is the option you should have on off.

Something seems way off in your setup. IMO, a single pod can do 100-300 concurrent operations.

github-actions[bot] commented 3 months ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.

github-actions[bot] commented 2 weeks ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.