goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
https://goharbor.io
Apache License 2.0
24.28k stars 4.77k forks source link

harbor-core sporadically throws HTTP.5xx - Database performance related #15048

Closed phoinixgrr closed 2 years ago

phoinixgrr commented 3 years ago

Running Harbor v2.2.1, and using an AWS RDS as database. Some specs for our Harbor deployment:

We are observing high database usage generated from harbor. We've enabled PostgreSQL slow_log, to gather some insights on the database usage. The following queries are reported with duration over 10seconds:

| 1622621735000 | 2021-06-02 08:15:35 UTC:10.1.15.1(1331):harbor@core_db:[21602]:LOG:  duration: 13985.656 ms  execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE    | mgmt-harbor-serverless-prod.1 |
| 1622621735000 | 2021-06-02 08:15:35 UTC:10.1.17.254(33891):harbor@core_db:[21663]:LOG:  duration: 14040.258 ms  execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE | mgmt-harbor-serverless-prod.1 |
| 1622621735000 | 2021-06-02 08:15:35 UTC:10.1.11.177(40627):harbor@core_db:[21719]:LOG:  duration: 13736.308 ms  execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE | mgmt-harbor-serverless-prod.1 |

These are FOR UPDATE queries in which causes the rows retrieved by the SELECT statement to be locked as though for update.

Database Locking graph confirms the DB locking issue, showing 400+ locked tables.
Screen Shot 2021-06-02 at 12 23 43 PM

When this happens, we are sporadically getting HTTP.5xx errors thrown by harbor on PUT Requests, thus image upload fails for clients:

"PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" "  8516 13.609 [harbor-harbor-core-http] [] 100.97.80.52:8080 66 13.612 500 90bf9ce9edddd7d33ccd7014dae4a47e
"PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" " 2269 6.295 [harbor-harbor-core-http] [] 100.97.81.139:8080 66 6.296 500 9fa9d68459840391c48486e08061fbed
"PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" "[...] 561 30.762 [harbor-harbor-core-http] [] 100.97.81.139:8080 66 30.760 500 2cc59940a45cec7fd5b372fe74c88f5c
"PUT /v2/beat/[...] /manifests/[...] HTTP/1.1" 500 66 "-" [...] " 3309 6.366 [harbor-harbor-core-http] [] 100.97.81.40:8080 66 6.364 500
"PUT /v2/beat/[...] [...]blobs/uploads/cb4b12f8-98f8-43a1-9e27-04bfe1d46f80? HTTP/1.1" 500 66 "-" "[...] 1584 23.598 [harbor-harbor-core-http] [] 100.97.80.52:8080 66 23.600 500

and at the same time Harbor throws errors that suggest issues with the database:

[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/projects/beat/repositories/[...]/artifacts request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:86]: commit transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/projects/beat/repositories/[...]/artifacts request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:86]: commit transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /v2/beat/[....]/manifests/[...] request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:78]: rollback transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: driver: bad connection"}]} 
[ERROR] [/server/middleware/blob/put_manifest.go:63][middleware="blob" requestID="07fa6cc169437ec9272eebdab971e216"]: get project failed, error: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/projects/beat/repositories/[...]/artifacts request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:86]: commit transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:78]: rollback transaction failed: driver: bad connection
[ERROR] [/controller/quota/controller.go:299][requestID="c5be4a76ee93e8f79beaf362a309f1de"]: failed to calculate quota usage for project 37, error: driver: bad connection

Database specs are upgraded to be able to cope with this. Currently running under 16VCPUs and 122GB Ram, dedicated to Harbor.

Could this suggest an issue with the way that Harbor is managing the Database?

paulliss commented 3 years ago

afaik your version have this trouble when you upgrade from previous versions. look for it - it has dusscussed already somewhere in https://github.com/goharbor/

also you can tune maxIdleConns & maxOpenConns for harbor-harbor-core there is 50 & 1000 by default, my working set have 15 & 175 respectively

heww commented 3 years ago

Hi, @phoinixgrr how many concurrencies do you use to pull push the images? Are they pulling pushing the images in the same project?

phoinixgrr commented 3 years ago

afaik your version have this trouble when you upgrade from previous versions. look for it - it has dusscussed already somewhere in https://github.com/goharbor/

also you can tune maxIdleConns & maxOpenConns for harbor-harbor-core there is 50 & 1000 by default, my working set have 15 & 175 respectively

Hi @paulliss, thanks for replying ,

So, there is an issue with migrating the database when updating from older versions? And this can cause this performance regression / 5xx on harbor side? Could you point me to the Issue, I seen to have trouble founding it :(

We've opted for not re-using open connections. (drop Connection pooling) with:

- name: POSTGRESQL_MAX_IDLE_CONNS        value: "-1"      
- name: POSTGRESQL_MAX_OPEN_CONNS        value: "-1"

Hi, @phoinixgrr how many concurrencies do you use to pull the images? Are they pulling the images in the same project?

Hey @heww, Thanks for stepping into this. We are using Docker to perform pushes and pulls. The defaults are in use, (3 concurrent connection for pulling, and 5 concurrent connection for pushing)

--max-concurrent-downloads int          Set the max concurrent downloads for each pull (default 3)
--max-concurrent-uploads int            Set the max concurrent uploads for each push (default 5)

Yes, most of our images are under the same Project (95% + )

paulliss commented 3 years ago

Hey, @phoinixgrr. I'll try to find the link out. One more question: what kind of store do you use for your images/charts?

phoinixgrr commented 3 years ago

Hey, @phoinixgrr. I'll try to find the link out. One more question: what kind of store do you use for your images/charts?

Object storage. Amazon S3 @paulliss

paulliss commented 3 years ago

@phoinixgrr here the link: https://github.com/goharbor/harbor/issues/14358 But I made a mistake it all goes about upgrade to 2.2.0 :(

See if the service harbor-harbor-core has a restarts?

phoinixgrr commented 3 years ago

@phoinixgrr here the link: #14358 But I made a mistake it all goes about upgrade to 2.2.0 :(

See if the service harbor-harbor-core has a restarts?

No restarts under harbor-core @paulliss

@heww
Since we are not using Project Quotas, Is it possible to disable those queries in order to avoid this extreme locking from taking place?

SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE 
SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE 
SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE
paulliss commented 3 years ago

@phoinixgrr any news?

Just remembered that I was in such trouble when there were not enough IOPS, because of my SATA disks

aladjadj commented 3 years ago

Hello, if you do not use quota feature, you can disable from api to mitigate your errors , look the swagger api

phoinixgrr commented 3 years ago

@phoinixgrr any news?

Just remembered that I was in such trouble when there were not enough IOPS, because of my SATA disks

@paulliss There is no observed IOWAIT on any of the subsystems, either Harbor itself, nor the RDS Database. Storage is performant.

Other than that, things are not looking good. When this extreme locking to the Database is performed, Harbor starts throwing 5xx errors to image pushes. The following query has been identified to produce the locking:

SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE 

Hello, if you do not use quota feature, you can disable from api to mitigate your errors , look the swagger api

@aladjadj Quotas are reported as disabled (-1) from the API:

[
  {
    "id": 100,
    "ref": {
      "id": 464,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-03-19T10:14:01.046001Z",
    "update_time": "2021-03-19T10:21:04.23482Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 5607868043
    }
  },
  {
    "id": 67,
    "ref": {
      "id": 167,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-03-03T11:03:59.948701Z",
    "update_time": "2021-03-24T09:33:17.246179Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 80959601
    }
  },
  {
    "id": 45,
    "ref": {
      "id": 56,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-27T14:10:31.390346Z",
    "update_time": "2021-03-17T09:22:06.392402Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 998161523
    }
  },
  {
    "id": 44,
    "ref": {
      "id": 55,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-27T14:09:10.222079Z",
    "update_time": "2021-03-17T09:22:06.365389Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 0
    }
  },
  {
    "id": 41,
    "ref": {
      "id": 43,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-20T11:05:34.104598Z",
    "update_time": "2021-03-17T09:22:06.349697Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 53979915
    }
  },
  {
    "id": 37,
    "ref": {
      "id": 37,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-19T12:27:09.413684Z",
    "update_time": "2021-06-07T07:57:46.678234Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 4569325322201
    }
  },
  {
    "id": 36,
    "ref": {
      "id": 36,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-19T12:01:19.862108Z",
    "update_time": "2021-04-20T06:46:28.082002Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 119763371
    }
  }
]
heww commented 3 years ago

Hi @phoinixgrr -1 is not used to disable the quota.

Please call /api/internal/switchquota API using admin to disable the quota feature in the harbor.

curl -vkL -X PUT -u admin:Harbor12345 -H "Content-Type: application/json" -d '{"Enabled": false}' https://host/api/internal/switchquota
phoinixgrr commented 3 years ago

@heww
Calling API endpoint /api/internal/switchquota actually fixed our extreme locking! Awesome. Seems to be working as a workaround in our case. Question: Should this indicate an issue with the way harbor manages the Database?


Since then, we've performed a cleanup on our largest project, by deleting all untagged artifacts (using TAG RETENTION rule ) And we have a new kind of HTTP.500 issues thrown, when pushes happens. We've managed to drill-down the logs to identify this new issue.

  1. Docker client gets HTTP 500 Internal Server error, from harbor, when PUSHING : Jun 8 07:28:25 ip-10-30-2-156 dockerd[541]: level=info msg="Attempting next endpoint for push after error: received unexpected HTTP status: 500 Internal Server Error"

  2. Nginx-ingress confirms the 500 thrown by Harbor:

    2021-06-08 07:28:25 192.168.5.196 - - [08/Jun/2021:07:28:25 +0000] "PUT /v2/beat/[....] HTTP/1.1" 500 66 "-" "docker/[...]  git-commit/[...]-aws os/linux arch/amd64 UpstreamClient(Docker-Client[...])" 6844 0.460 [harbor-harbor-core-http] [] 100.97.81.117:8080 66 0.460 500 95ef3bdd3ff37311192eb8eea8bf9927
  3. Harbor logs suggest Database related issues: 2021-06-08T07:28:25Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]}

  4. PostgrsSQL logs, show duplicate keys/ unique constraint violated.

    2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:ERROR: duplicate key value violates unique constraint "unique_artifact"
    2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:DETAIL:  Key (repository_id, digest)=(311, sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76) already exists.
    2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:STATEMENT:  INSERT INTO "artifact" ("type", "media_type", "manifest_media_type", "project_id", "repository_id", "repository_name", "digest", "size", "icon", "push_time", "pull_time", "extra_attrs", "annotations") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id"
    2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:ERROR: current transaction is aborted, commands ignored until end of transaction block

As conclusion, we've managed to Workaround our main issue concerning the extreme database locking happening. Unfortunately, it seems like the cleaning performed on the untagged artifacts, has introduced new issues with the database, related to duplicate_keys (unique_artifact)

heww commented 3 years ago

According to your logs, can you confirm that do you have pushing actions during deleting all untagged artifacts?

heww commented 3 years ago

Could you provide the count of the records in the quota table?

Taking about 13985.656ms to execute a SELECT FOR UPDATE SQL is not normal. Can you confirm that there isn't too much pushing for your harbor instance?

How much CPU and memory the Postgres using when you got this problem?

phoinixgrr commented 3 years ago

Hey @heww , thanks for your input.

According to your logs, can you confirm that do you have pushing actions during deleting all untagged artifacts?

During the duration of the cleanup which lasted 337min performed yesterday, there were pushing actions performed. But in today's specific logs that we've shared (2021-06-08 07:28:25 UTC) there was no cleanup action on-going.

Could you provide the count of the records in the quota table?

core_db=> select COUNT(*) from quota;
 count
-------
     7
(1 row)

Taking about 13985.656ms to execute a SELECT FOR UPDATE SQL is not normal. Can you confirm that there isn't too much pushing for your harbor instance?

Nginx-Ingress HTTP/PUT requests forwarded to harbor, last 24h: Screen Shot 2021-06-08 at 2 33 29 PM

Nginx-Ingress HTTP/PUT requests forwarded to harbor, last 7d: Screen Shot 2021-06-08 at 2 33 48 PM

How much CPU and memory the Postgres using when you got this problem?

Problem is observed when we get > 250-300 Lock Tables, when harbor runs SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE on the database :

Screen Shot 2021-06-08 at 2 15 45 PM

DB CPU/MEM usage, at the same time window as the above graph, seems nominal : Screen Shot 2021-06-08 at 2 15 38 PM

This is a 16VCPUs and 122GB Ram RDS DB Instance, dedicated to Harbor.


Just to make sure we are on the same page, the locking issue is now mitigated with https://github.com/goharbor/harbor/issues/15048#issuecomment-855767153 as proposed.

We now have a new issue concerning the database, with duplicate_keys (unique_artifact), that seem to came up after the cleanup performed with untagged artifacts. Would you advise opening a new issue about this?

ywk253100 commented 3 years ago

@phoinixgrr For the duplicate key issue, is it happening for all image pushing or just for some specific images?

Could you run the following sql and provide us the output:

select * from repository where repository_id= 311;
select * from repository where name = (select name from repository where repository_id=311);

select * from artifact where repository_name=(select name from repository where repository_id=311) and digest='sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76';

select * from artifact where repository_id=311 and digest='sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76';

BTW, is your Harbor instance a fresh installation or upgrade from an old version?

phoinixgrr commented 3 years ago

Hey @ywk253100 , thanks for stepping in.

@phoinixgrr For the duplicate key issue, is it happening for all image pushing or just for some specific images?

This seems to be completely random. Moreover, re-running the same push, succeeds.
It seems to have started after cleaning up all untagged artifacts from our biggest project (2021-06-07):

Occurrences latest 30 days:

2021-06-09T07:24:31Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-09T02:27:24Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-08T12:24:23Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-08T12:21:29Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-08T07:28:25Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-08T07:17:06Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-08T06:18:49Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 
2021-06-08T06:18:49Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 

Could you run the following sql and provide us the output:

select * from repository where repository_id= 311;
select * from repository where name = (select name from repository where repository_id=311);

select * from artifact where repository_name=(select name from repository where repository_id=311) and digest='sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76';

select * from artifact where repository_id=311 and digest='sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76';

core_db=> select * from repository where repository_id=311;
repository_id |   name    | project_id | description | pull_count | star_count |       creation_time        |        update_time
---------------+-----------+------------+-------------+------------+------------+----------------------------+---------------------------
311 | beat/[...] |         37 |             |     142784 |          0 | 2021-01-26 10:38:04.354809 | 2021-06-09 08:37:33.40615
(1 row)

core_db=> select * from repository where name = (select name from repository where repository_id=311); repository_id | name | project_id | description | pull_count | star_count | creation_time | update_time ---------------+-----------+------------+-------------+------------+------------+----------------------------+--------------------------- 311 | beat/[...] | 37 | | 142784 | 0 | 2021-01-26 10:38:04.354809 | 2021-06-09 08:37:33.40615 (1 row)

core_db=> select * from artifact where repository_name=(select name from repository where repository_id=311) and digest='sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76'; id project_id repository_name digest type pull_time push_time repository_id media_type manifest_media_type size
541303
37 beat/[...] sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76 IMAGE 2021-06-08 07:29:17.367729 2021-06-08 07:28:25.469699 311 application/vnd.docker.container.image.v1+json application/vnd.docker.distribution.manifest.v2+json 233497922 {[....]"StopSignal":"SIGQUIT"},"created":"2021-06-08T07:28:15.044442783Z","os":"linux"}

(1 row)

core_db=> select * from artifact where repository_id=311 and digest='sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76'; id project_id repository_name digest type pull_time push_time repository_id media_type manifest_media_type size
tions
icon 541303 37 beat/[...] sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76 IMAGE 2021-06-08 07:29:17.367729 2021-06-08 07:28:25.469699 311 application/vnd.docker.container.image.v1+json application/vnd.docker.distribution.manifest.v2+json 233497922 {[...]"StopSignal":"SIGQUIT"},"created":"2021-06-08T07:28:15.044442783Z","os":"linux"}

(1 row)



> BTW, is your Harbor instance a fresh installation or upgrade from an old version?

This is an upgrade from an old version. Specifically, this is the upgrade path `2.1.3 > 2.2.0 > 2.2.1 > 2.2.2` 
ywk253100 commented 3 years ago

@phoinixgrr Is there concurrent pushing in your env, such as triggered by CI/CD?

phoinixgrr commented 3 years ago

@phoinixgrr Is there concurrent pushing in your env, such as triggered by CI/CD?

@ywk253100 Yes there is. 95% of the pushes happen by CI/CD. We are continuously testing/building/deploying, therefore there is high likelihood for concurrent image pushes. Moreover the defaults for docker client concurrency settings are in effect:

--max-concurrent-downloads int         Set the max concurrent downloads for each pull (default 3)
--max-concurrent-uploads int            Set the max concurrent uploads for each push (default 5)

Today's HTTP.500 occurrence:

DockerClientLog:

+ docker push registry.priv[....]/beat/[....]
The push refers to repository [registry.priv[....]/beat/[...]]
449af51d3268: Preparing
44ee68603e8e: Preparing
44ee68603e8e: Layer already exists
449af51d3268: Pushed
received unexpected HTTP status: 500 Internal Server Error

NginxIngress Log (client received HTTP 500 from harbor during image push):

2021-06-09 07:24:31 
192.168.5.23 - - [09/Jun/2021:07:24:31 +0000] "PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" "docker[...] go/go[...] git-commit/[...] \x5C(linux\x5C))" 2074 0.488 [harbor-harbor-core-http] [] 100.97.80.156:8080 66 0.488 500 8f0a161a4dfc24e4b28e5e426ad8842e

HarborLog:

2021-06-09T07:24:31Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]} 

PostgresLog:

1623223471000,"2021-06-09 07:24:31 UTC:10.1.15.167(45471):harbor@core_db:[21482]:ERROR:  duplicate key value violates unique constraint ""unique_artifact""",mgmt-harbor-serverless-prod.1
1623223471000,"2021-06-09 07:24:31 UTC:10.1.15.167(45471):harbor@core_db:[21482]:DETAIL:  Key (repository_id, digest)=(1955, sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1) already exists.",mgmt-harbor-serverless-prod.1
1623223471000,"2021-06-09 07:24:31 UTC:10.1.15.167(45471):harbor@core_db:[21482]:STATEMENT:  INSERT INTO ""artifact"" (""type"", ""media_type"", ""manifest_media_type"", ""project_id"", ""repository_id"", ""repository_name"", ""digest"", ""size"", ""icon"", ""push_time"", ""pull_time"", ""extra_attrs"", ""annotations"") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING ""id""",mgmt-harbor-serverless-prod.1
1623223471000,"2021-06-09 07:24:31 UTC:10.1.15.167(45471):harbor@core_db:[21482]:ERROR:  current transaction is aborted, commands ignored until end of transaction block",mgmt-harbor-serverless-prod.1
1623223471000,"2021-06-09 07:24:31 UTC:10.1.15.167(45471):harbor@core_db:[21482]:STATEMENT:  SELECT T0.""id"", T0.""type"", T0.""media_type"", T0.""manifest_media_type"", T0.""project_id"", T0.""repository_id"", T0.""repository_name"", T0.""digest"", T0.""size"", T0.""icon"", T0.""push_time"", T0.""pull_time"", T0.""extra_attrs"", T0.""annotations"" FROM ""artifact"" T0 WHERE T0.""digest"" = $1 AND T0.""repository_name"" = $2 ",mgmt-harbor-serverless-prod.1
1623223471000,2021-06-09 07:24:31 UTC:10.1.22.177(54505):harbor@core_db:[21473]:LOG:  disconnection: session time: 0:00:00.557 user=harbor database=core_db host=10.1.22.177 port=54505,mgmt-harbor-serverless-prod.1
ywk253100 commented 3 years ago

@phoinixgrr Could you check the logs of Nginx to find out that whether there is another same API request PUT /v2/beat/[...] which got a 2xx response at the same period with the one got 500 error?

phoinixgrr commented 3 years ago

@ywk253100

So at the time of the HTTP.500 failure, there were a total of 5 image pushes under the same repo, with different tagging, triggered by CI/CD. These were pushes with 30 seconds to 0 seconds difference between them.

1) 192.168.5.23 - - [09/Jun/2021:07:24:31 +0000] "PUT /v2/beat/repo_name/manifests/phobos HTTP/1.1" 500 66 "-" "docker/[...] go/go[...] git-commit/[...]" 2074 0.488 [harbor-harbor-core-http] [] 100.97.80.156:8080 66 0.488 500 8f0a161a4dfc24e4b28e5e426ad8842e
2) 192.168.5.23 - - [09/Jun/2021:07:24:31 +0000] "PUT /v2/beat/repo_name/manifests/deimos HTTP/1.1" 201 0 "-" "docker/[...] go/go[...] git-commit/[...]" 2074 0.561 [harbor-harbor-core-http] [] 100.97.81.251:8080 0 0.560 201 85070e11d9baf70eef7f563537bcd0b0
3) 192.168.3.168 - - [09/Jun/2021:07:24:19 +0000] "PUT /v2/beat/repo_name/manifests/kratos HTTP/1.1" 201 0 "-" "docker/[...] go/go[...] git-commit/[...]" 2075 0.451 [harbor-harbor-core-http] [] 100.97.80.245:8080 0 0.448 201 5aa3781f5aacfd784eb016a6ea9721f7
4) 192.168.5.41 - - [09/Jun/2021:07:24:16 +0000] "PUT /v2/beat/repo_name/manifests/thanos HTTP/1.1" 201 0 "-" "docker/[...] go/go[...] git-commit/[...]" 2074 0.454 [harbor-harbor-core-http] [] 100.97.80.245:8080 0 0.452 201 cf204480e80baac7faf10c7d74f3a45c
5) 192.168.5.37 - - [09/Jun/2021:07:24:09 +0000] "PUT /v2/beat/repo_name/manifests/zeus HTTP/1.1" 201 0 "-" "docker/[...] go/go[...] git-commit/[...]" 2074 0.529 [harbor-harbor-core-http] [] 100.97.81.251:8080 0 0.528 201 15af31bc1e4c196969ac0d373f4f5880

Push /v2/beat/repo_name/manifests/phobos and /v2/beat/repo_name/manifests/deimos seem to have happened at the same second, and triggered by the same host.

Push /v2/beat/repo_name/manifests/deimos succeded with HTTP/201 reply by harbor Push /v2/beat/repo_name/manifests/phobos failed with HTTP/500 thrown by harbor.

And here is the view from Harbor GUI:

image

Showing all images updated with the same "Push Time" , except "phobos" , since it failed with HTTP/500.

bitsf commented 3 years ago

digests of phobos and deimos are not same. does digest e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 have other tags ? can you grep e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 in the logs

phoinixgrr commented 3 years ago

digests of phobos and deimos are not same. does digest e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 have other tags ? can you grep e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 in the logs

Hi @bitsf . Thanks for pitching In

The requested logging:

2021-06-09T07:25:08Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'ArtifactWebhook' on topic 'PULL_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator- OccurAt-2021-06-09 07:25:08
2021-06-09T07:25:08Z [DEBUG] [/controller/event/handler/webhook/artifact/artifact.go:76]: cannot find policy for PULL_ARTIFACT event: ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator- OccurAt-2021-06-09 07:25:08
2021-06-09T07:25:08Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'AuditLog' on topic 'PULL_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator- OccurAt-2021-06-09 07:25:08
2021-06-09T07:25:08Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'InternalArtifact' on topic 'PULL_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator- OccurAt-2021-06-09 07:25:08
2021-06-09T07:24:32Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'P2PPreheat' on topic 'PUSH_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator-robot$registry OccurAt-2021-06-09 07:24:31
2021-06-09T07:24:31Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'Replication' on topic 'PUSH_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator-robot$registry OccurAt-2021-06-09 07:24:31
2021-06-09T07:24:31Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'ArtifactWebhook' on topic 'PUSH_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator-robot$registry OccurAt-2021-06-09 07:24:31
2021-06-09T07:24:31Z [DEBUG] [/controller/event/handler/webhook/artifact/artifact.go:76]: cannot find policy for PUSH_ARTIFACT event: ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator-robot$registry OccurAt-2021-06-09 07:24:31
2021-06-09T07:24:31Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'AuditLog' on topic 'PUSH_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator-robot$registry OccurAt-2021-06-09 07:24:31
2021-06-09T07:24:31Z [DEBUG] [/controller/event/handler/p2p/preheat.go:72]: preheat: artifact pushed beat/repo_name:[deimos]@sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1
2021-06-09T07:24:31Z [INFO] [/pkg/notifier/notifier.go:204]: Handle notification with Handler 'InternalArtifact' on topic 'PUSH_ARTIFACT': ID-543705, Repository-beat/repo_name Tags-[deimos] Digest-sha256:e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1 Operator-robot$registry OccurAt-2021-06-09 07:24:31
2021-06-09 07:24:31 time="2021-06-09T07:24:31.607906571Z" level=debug msg="s3aws.PutContent("/docker/registry/v2/repositories/beat/repo_name/_manifests/tags/phobos/index/sha256/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/link")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=7ae27fd7-8747-4c6a-9e78-7b25b8c14cf2 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/phobos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=16.71554ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).PutContent" trace.id=1f4d0de6-1d05-47c2-bd6f-840482ac92a2 trace.line=110 vars.name="beat/repo_name" vars.reference=phobos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.59106436Z" level=debug msg="s3aws.PutContent("/docker/registry/v2/repositories/beat/repo_name/_manifests/revisions/sha256/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/link")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=7ae27fd7-8747-4c6a-9e78-7b25b8c14cf2 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/phobos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=23.835078ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).PutContent" trace.id=0698f6d9-7332-437e-89ee-b7976974d765 trace.line=110 vars.name="beat/repo_name" vars.reference=phobos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.590042543Z" level=debug msg="s3aws.PutContent("/docker/registry/v2/repositories/beat/repo_name/_manifests/tags/deimos/index/sha256/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/link")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=04ec8916-d743-4cd2-9f86-834468256266 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/deimos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=18.819055ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).PutContent" trace.id=f61897d7-0987-49b1-a5ac-433dff345d73 trace.line=110 vars.name="beat/repo_name" vars.reference=deimos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.570963693Z" level=debug msg="s3aws.PutContent("/docker/registry/v2/repositories/beat/repo_name/_manifests/revisions/sha256/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/link")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=04ec8916-d743-4cd2-9f86-834468256266 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/deimos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=17.101312ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).PutContent" trace.id=31a124b2-51b0-4281-b695-c728335ff764 trace.line=110 vars.name="beat/repo_name" vars.reference=deimos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.567185219Z" level=debug msg="s3aws.PutContent("/docker/registry/v2/blobs/sha256/e1/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/data")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=7ae27fd7-8747-4c6a-9e78-7b25b8c14cf2 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/phobos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=27.177579ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).PutContent" trace.id=f98e95c3-f9cd-41b8-98b3-2a77131ba099 trace.line=110 vars.name="beat/repo_name" vars.reference=phobos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.553802989Z" level=debug msg="s3aws.PutContent("/docker/registry/v2/blobs/sha256/e1/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/data")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=04ec8916-d743-4cd2-9f86-834468256266 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/deimos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=21.441213ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).PutContent" trace.id=5a079dfd-314a-41a9-8b2e-e60ea4f905f7 trace.line=110 vars.name="beat/repo_name" vars.reference=deimos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.539964795Z" level=debug msg="s3aws.Stat("/docker/registry/v2/blobs/sha256/e1/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/data")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=7ae27fd7-8747-4c6a-9e78-7b25b8c14cf2 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/phobos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=9.76884ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).Stat" trace.id=2f00905e-891b-4fbf-911c-f4a1adae1373 trace.line=155 vars.name="beat/repo_name" vars.reference=phobos 
2021-06-09 07:24:31 time="2021-06-09T07:24:31.532295856Z" level=debug msg="s3aws.Stat("/docker/registry/v2/blobs/sha256/e1/e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1/data")" auth.user.name="harbor_registry_user" go.version=go[...] http.request.contenttype="application/vnd.docker.distribution.manifest.v2+json" http.request.host=registry.private[...] http.request.id=04ec8916-d743-4cd2-9f86-834468256266 http.request.method=PUT http.request.remoteaddr=192.168.5.23 http.request.uri="/v2/beat/repo_name/manifests/deimos" http.request.useragent="docker/[...] go/[...] git-commit/[...]-0ubuntu1~[...] kernel/[...] os/linux arch/amd64 UpstreamClient(Docker-Client/[...] \(linux\))" trace.duration=33.502592ms trace.file="/bitnami/blacksmith-sandox/docker-distribution-2.7.1/src/github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).Stat" trace.id=6de73caf-15bb-4c9c-abf8-382bfcd46ebe trace.line=155 vars.name="beat/repo_name" vars.reference=deimos 
ywk253100 commented 3 years ago

According to the logs, the images beat/repo_name:phobos and beat/repo_name:deimos have the same digest sha256: e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1

When pushing the images with the same digest into the same repository, it may hit a bug.

@phoinixgrr Thanks for your input. I'll create a PR to fix it

phoinixgrr commented 3 years ago

According to the logs, the images beat/repo_name:phobos and beat/repo_name:deimos have the same digest sha256: e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1

When pushing the images with the same digest into the same repository, it may hit a bug.

@phoinixgrr Thanks for your input. I'll create a PR to fix it

Awesome 🚀

Let me know if you end-up building and publishing, a harbor image containing the hotfix. We can deploy it from our side, to confirm that it is indeed fixed. We hit this 3-4 times per day.

Best

bitsf commented 3 years ago

pr https://github.com/goharbor/harbor/pull/15104 is fixing this

phoinixgrr commented 3 years ago

pr #15104 is fixing this

Awesome @bitsf . Thanks for the update.

There were two(2) issues pointed out:

  1. The issue with extreme locking on the database, when triggering the following query concerning quota calculation: harbor@core_db:[21602]:LOG: duration: 13985.656 ms execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE This was worked-around with https://github.com/goharbor/harbor/issues/15048#issuecomment-855767153
  2. The issue when pushing the images with the same digest into the same repository, it may hit a bug. Specifically: ERROR: duplicate key value violates unique constraint ""unique_artifact""

Correct me if I am wrong, the PR addreses issue 2. Are there any plans to improve/work on 1 ?