goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
https://goharbor.io
Apache License 2.0
23.93k stars 4.74k forks source link

Replication breaks artifact original push order -> renders retention rules useless #17401

Closed tomassitar closed 1 year ago

tomassitar commented 2 years ago

Expected behavior and actual behavior: Expected: When i replicate complete repository from an endpoint (or even whole endpoint) artifacts should retain their push order so retention rules can efficiently delete the oldest ones. Actual: Artifacts are replicated in random order, which results in deletion of even fresh artifacts when a retention rule is applied.

Steps to reproduce the problem:

  1. Replicate whole repository from an endpoint, which contains for example 10 artifacts, all pushed one by one within some period of time.
  2. Set a retention policy to retain the most recently pushed 7 artifacts
  3. The pseudo-oldest 3 artifacts get deleted, but in reality these images were just fastest to replicate, not the oldest from their content/creation perspective
example: Digest Tag Kind Labels PushedTime PulledTime CreatedTime Retention
sha256:b18ce4feee54cb7b3d6f6ccbab30b7a4a61d38741c3fd469203e4251a9fd2449 2021.11.09.160838-g6c092d73ead-31 image 2022/08/14 14:45:21 2022/08/14 14:45:20 RETAIN
sha256:51f4b57d07ac47b88acfe5c71561ce4feee54b7764c1bd849334e1ad2b286b9f 2022.03.22.155223-g7fbb37cbdee-14 image 2022/08/14 14:45:17 2022/08/14 14:45:17 RETAIN
sha256:9aeee3d7310d5dce4feee52b97ed14ded73a22ad682993cf9cb7bcc24de0acb9 2022.03.17.141325-g217b3a971ac-5 image 2022/08/14 14:45:14 2022/08/14 14:45:14 RETAIN
sha256:39cc5085659173f87e1203668d1cb2cbcfd40091dace4feee5a5f6366d09f650 latest,2022.06.06.152716-gad95b797c27-6 image 2022/08/14 14:45:07 2022/08/14 14:45:06 RETAIN
sha256:14db1d1f8496c25b8e68abce4feee530c4c3afd30b90f5769a88902c03c602eb 2021.11.02.155904-ge23cfd7935b-27 image 2022/08/14 14:44:56 2022/08/14 14:44:56 RETAIN
sha256:f6ace7f5d9b64d71ce4feee5eb1022ec12b781c6adb5fc07445f249c2d423b36 2022.03.22.152349-gdf7b5a3478a-8 image 2022/08/14 14:44:50 2022/08/14 14:44:50 DEL
sha256:485c6310ace4feee52ca504ec2daf3eb449cf8fb29822df8b609a2e05dc0bc15 2022.03.17.123628-g63cc883c311-4 image 2022/08/14 14:44:43 2022/08/14 14:44:43 DEL
sha256:2541e1bc3680db3113120fe2a0db3738aa9d39b69307575974ce4feee5005869 2022.03.17.173230-g217b3a971ac-1 image 2022/08/14 14:44:33 2022/08/14 14:44:33 DEL
sha256:e454ffd5a0e986449ece4feee53d3c25abbddd407d00eb3dc6207931d280d9e6 2021.11.02.160814-ge23cfd7935b-1 image 2022/08/14 14:44:16 2022/08/14 14:44:15 DEL

The artifacts tags represent the real build and push time of these artifacts to the original repository. There are 9 artifacts of which 4 oldest should be deleted.

The oldest are: 2022.03.17.141325-g217b3a971ac-5 2021.11.09.160838-g6c092d73ead-31 2021.11.02.160814-ge23cfd7935b-1 2021.11.02.155904-ge23cfd7935b-27

As you can see only one of the oldest artifacts is marked to be deleted, because it was replicated in wrong order and its "push time" has changed.

I pressumed that this issue is caused by more worker working simultaneously, which might resulted in some images getting replicated faster, but even setting just one job worker resulted in the same outcome.

Versions:

chlins commented 2 years ago

The replication jobs were executed by jobservice workers, and the workers number is 10 by default, so as the parallel process of jobs execution, harbor can not guarantee the order of these jobs because they often have different execution time.

tomassitar commented 2 years ago

Thank you for such a fast reply. Even when i set max_job_workers: 1 in the harbor.yml file and reinstall harbor, the result is the same.

The issue seems to be the tags list retrieved from the endpoint. I tried to get the repository content with cURL and the order of tags is the same as the order in which the tags get replicated

  "tags": [
    "2021.11.09.160838-g6c092d73ead-31",
    "2022.03.22.155223-g7fbb37cbdee-14",
    "2022.03.17.141325-g217b3a971ac-5",
    "latest",
    "2021.11.02.155904-ge23cfd7935b-27",
    "2022.03.22.152349-gdf7b5a3478a-8",
    "2022.03.17.123628-g63cc883c311-4",
    "2022.03.17.173230-g217b3a971ac-1",
    "2022.06.06.152716-gad95b797c27-6",
    "2021.11.02.160814-ge23cfd7935b-1"
  ]

Would it be possible to somehow sort the to-be-replicated tags of the repository during the replication?

chlins commented 2 years ago

Yes, harbor splits up replication jobs by the default sort of tags, as the design of jobservice, harbor does not care about sequence of these tasks, for example, although if we can sort tags before submit jobs but if the job workers number is over 1, the jobservice can not guarantee the first submitted job completed earlier than the second one, so you can open a discussion for this issue. And currently an elaborate way to resolve your requirement is just control the execution by yourself, eg: list tags -> sort by self -> create replication with exactly tag -> trigger replication.

tomassitar commented 2 years ago

We have currently 20+ projects with 2000+ repositories. There are tens of thousands of tags there. Although it is technically possible to script a replication of singular tags and slowly add all of them one by one to each repository, it would take forever. I do understand that this is not an issue for standard usage, but when you have more separated Harbor instances across more datacenters in a cluster as a failsafe policy and one gets corrupted, you can't just create new instance and replicate the data to it, because it could break the artifact order again. It does not even need to get corrupted - a simple downtime for maintenance would be enough if more tags were pushed to a repo in the meantime while the instance was down.

Maybe I am just missing something, but I do not see any other automated way to make 1:1 copy of the data without having to literally clone other instance then your proposition, which to me seems more like a workaround.

A solution for me would be either adding the ability to sort tags during replication or add an option to the retention policy's tag filter to work with date and time. This could be potentially done with regex pattern matching, i guess.

rrgadeev commented 2 years ago

Why don't you set up replication by events, including deletion. And set up a retention policy in the source repository

tomassitar commented 2 years ago

Hi, maybe I misunderstood your proposition, but it seems like you are referring to a situation when you push/pull/delete a single artifact during normal usage. This is not the issue, once the harbor instance is up and running, it does its job well. The issue is when you need to replicate more then 1 tag/artifact (for example a migration from different technology to Harbor or when you add new instance to the cluster or even a maintenance downtime should it be necessary). In this case, artifacts get replicated in random order, which breaks the retention by pushed time. Harbor doesn't care about their real age and does delete even recent images

tomassitar commented 2 years ago

This is an alarming example and sadly real one. Although images have been replicated in a sorted order, their order is inverted. Which would result in deletion of the newest tags first.

Digest Tag Kind Labels PushedTime PulledTime CreatedTime Retention
sha256:47a6219fb8c3b1f223755f9a93ba80b73b94abbcd64692cd580393fce605c1e2 2022.08.16.133233-gb6d2adf4ec7-2 image 2022/08/23 09:55:05 2022/08/23 09:55:05 RETAIN
sha256:744c30dc96671f316e9ddf64d9ea2afd1beb3a554850a38ae5bff39926f6da2c 2022.08.16.134044-gb6d2adf4ec7-3 image 2022/08/23 09:55:03 2022/08/23 09:55:03 RETAIN
sha256:95c46cc3ac9d1b023c553a42d1a63e440380ff1e067a038768a35c2ff8f9e017 2022.08.23.093513-gcc992b1c5e0-26 image 2022/08/23 09:53:26 2022/08/23 09:53:26 DEL
sha256:e8316ced3ef2bdf68ad439cb40782214d3a6547561a284fd981bbb0812d4ac03 2022.08.23.100307-g67e621539c3-27 image 2022/08/23 09:53:18 2022/08/23 09:53:18 DEL
sha256:768cea76aa61bbe5d2d210d7462264f893610a1cc637959d4b2542691014f906 2022.08.23.103404-gf3dbfddae0e-28,latest image 2022/08/23 09:53:11 2022/08/23 09:53:11 DEL

I would definitely call this a bug. It should "replicate" the images not just "randomly re-push them", their properties should remain as close to what they were in the source endpoint as possible

WORKAROUND: We have found a workaround. All images contain their creation date in them. Once they are replicated, it is possible to read this creation date (Harbor actually reads and stores their properties automatically and this date is visible directly in GUI in artifact's tab) and replace the incorrect push dates directly in the database with the creation date. It at least restores the correct artifact order and allows their correct retention.

update artifact a
  set push_time = cast (a.extra_attrs::json->>'created' as timestamp)

update tag t
  set push_time = (select cast (a.extra_attrs::json->>'created' as timestamp) from artifact a where a.id = t.artifact_id)
rrgadeev commented 2 years ago

Hi, maybe I misunderstood your proposition, but it seems like you are referring to a situation when you push/pull/delete a single artifact during normal usage. This is not the issue, once the harbor instance is up and running, it does its job well. The issue is when you need to replicate more then 1 tag/artifact (for example a migration from different technology to Harbor or when you add new instance to the cluster or even a maintenance downtime should it be necessary). In this case, artifacts get replicated in random order, which breaks the retention by pushed time. Harbor doesn't care about their real age and does delete even recent images

I meant replication by events, for harbor-harbor. When the source itself deletes outdated data and replicates the changes. Yes, I understand you, I myself faced a similar problem. Therefore, when all artifacts were replicated forcibly, event replication was enabled and the source is responsible for the cleanup rules. The problem occurs if the source is unavailable or deleted.

tomassitar commented 2 years ago

I see your point now and the idea is good, but it would not work for us. The main issue is that some repositories get updated often while others receive even less then one update per month. We do retain last 30 images for production stuff, so it could take even a few years for some repositories to catch up, therefore the new harbor instance would have to run in a"slave mode" practically forever. It would also be a pain from the configuration stand point as we use puppet/consul-template to configure and maintain VMs and we would have to differentiate between a fully deployed instance with usable image history and a fresh instance with semi broken timestamps.

rrgadeev commented 2 years ago

Thoughts out loud. As a solution, you can create your own service that will receive a list of artifacts via the REST API, sort and replicate them in turn for each repository. Replication can also be performed using the REST API.

tomassitar commented 2 years ago

Yes, this was one of the first solutions that I came up with, too. Also, I could simply re-push every image one by one with a script by sorting the catalog first, but this presumes that you have sortable tags, which might not always be the case as some devs use commits ids as tags. In the end, rewriting the artifact push time and tag push time post-replication with the artifact's creation timestamp was the simplest, fastest and most reliable solution that required just two sql commands and was fairly safe to use.

Anyway, I can't help myself not to see this as bug. For what purpose would you have a replication feature, that breaks usage of another very important feature - content retention and removal.

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.