Closed tomassitar closed 1 year ago
The replication jobs were executed by jobservice workers, and the workers number is 10 by default, so as the parallel process of jobs execution, harbor can not guarantee the order of these jobs because they often have different execution time.
Thank you for such a fast reply. Even when i set max_job_workers: 1
in the harbor.yml
file and reinstall harbor, the result is the same.
The issue seems to be the tags list retrieved from the endpoint. I tried to get the repository content with cURL and the order of tags is the same as the order in which the tags get replicated
"tags": [
"2021.11.09.160838-g6c092d73ead-31",
"2022.03.22.155223-g7fbb37cbdee-14",
"2022.03.17.141325-g217b3a971ac-5",
"latest",
"2021.11.02.155904-ge23cfd7935b-27",
"2022.03.22.152349-gdf7b5a3478a-8",
"2022.03.17.123628-g63cc883c311-4",
"2022.03.17.173230-g217b3a971ac-1",
"2022.06.06.152716-gad95b797c27-6",
"2021.11.02.160814-ge23cfd7935b-1"
]
Would it be possible to somehow sort the to-be-replicated tags of the repository during the replication?
Yes, harbor splits up replication jobs by the default sort of tags, as the design of jobservice, harbor does not care about sequence of these tasks, for example, although if we can sort tags before submit jobs but if the job workers number is over 1, the jobservice can not guarantee the first submitted job completed earlier than the second one, so you can open a discussion for this issue. And currently an elaborate way to resolve your requirement is just control the execution by yourself, eg: list tags -> sort by self -> create replication with exactly tag -> trigger replication
.
We have currently 20+ projects with 2000+ repositories. There are tens of thousands of tags there. Although it is technically possible to script a replication of singular tags and slowly add all of them one by one to each repository, it would take forever. I do understand that this is not an issue for standard usage, but when you have more separated Harbor instances across more datacenters in a cluster as a failsafe policy and one gets corrupted, you can't just create new instance and replicate the data to it, because it could break the artifact order again. It does not even need to get corrupted - a simple downtime for maintenance would be enough if more tags were pushed to a repo in the meantime while the instance was down.
Maybe I am just missing something, but I do not see any other automated way to make 1:1 copy of the data without having to literally clone other instance then your proposition, which to me seems more like a workaround.
A solution for me would be either adding the ability to sort tags during replication or add an option to the retention policy's tag filter to work with date and time. This could be potentially done with regex pattern matching, i guess.
Why don't you set up replication by events, including deletion. And set up a retention policy in the source repository
Hi, maybe I misunderstood your proposition, but it seems like you are referring to a situation when you push/pull/delete a single artifact during normal usage. This is not the issue, once the harbor instance is up and running, it does its job well. The issue is when you need to replicate more then 1 tag/artifact (for example a migration from different technology to Harbor or when you add new instance to the cluster or even a maintenance downtime should it be necessary). In this case, artifacts get replicated in random order, which breaks the retention by pushed time. Harbor doesn't care about their real age and does delete even recent images
This is an alarming example and sadly real one. Although images have been replicated in a sorted order, their order is inverted. Which would result in deletion of the newest tags first.
Digest | Tag | Kind | Labels | PushedTime | PulledTime | CreatedTime | Retention |
---|---|---|---|---|---|---|---|
sha256:47a6219fb8c3b1f223755f9a93ba80b73b94abbcd64692cd580393fce605c1e2 | 2022.08.16.133233-gb6d2adf4ec7-2 | image | 2022/08/23 09:55:05 | 2022/08/23 09:55:05 | RETAIN | ||
sha256:744c30dc96671f316e9ddf64d9ea2afd1beb3a554850a38ae5bff39926f6da2c | 2022.08.16.134044-gb6d2adf4ec7-3 | image | 2022/08/23 09:55:03 | 2022/08/23 09:55:03 | RETAIN | ||
sha256:95c46cc3ac9d1b023c553a42d1a63e440380ff1e067a038768a35c2ff8f9e017 | 2022.08.23.093513-gcc992b1c5e0-26 | image | 2022/08/23 09:53:26 | 2022/08/23 09:53:26 | DEL | ||
sha256:e8316ced3ef2bdf68ad439cb40782214d3a6547561a284fd981bbb0812d4ac03 | 2022.08.23.100307-g67e621539c3-27 | image | 2022/08/23 09:53:18 | 2022/08/23 09:53:18 | DEL | ||
sha256:768cea76aa61bbe5d2d210d7462264f893610a1cc637959d4b2542691014f906 | 2022.08.23.103404-gf3dbfddae0e-28,latest | image | 2022/08/23 09:53:11 | 2022/08/23 09:53:11 | DEL |
I would definitely call this a bug. It should "replicate" the images not just "randomly re-push them", their properties should remain as close to what they were in the source endpoint as possible
WORKAROUND: We have found a workaround. All images contain their creation date in them. Once they are replicated, it is possible to read this creation date (Harbor actually reads and stores their properties automatically and this date is visible directly in GUI in artifact's tab) and replace the incorrect push dates directly in the database with the creation date. It at least restores the correct artifact order and allows their correct retention.
update artifact a
set push_time = cast (a.extra_attrs::json->>'created' as timestamp)
update tag t
set push_time = (select cast (a.extra_attrs::json->>'created' as timestamp) from artifact a where a.id = t.artifact_id)
Hi, maybe I misunderstood your proposition, but it seems like you are referring to a situation when you push/pull/delete a single artifact during normal usage. This is not the issue, once the harbor instance is up and running, it does its job well. The issue is when you need to replicate more then 1 tag/artifact (for example a migration from different technology to Harbor or when you add new instance to the cluster or even a maintenance downtime should it be necessary). In this case, artifacts get replicated in random order, which breaks the retention by pushed time. Harbor doesn't care about their real age and does delete even recent images
I meant replication by events, for harbor-harbor. When the source itself deletes outdated data and replicates the changes. Yes, I understand you, I myself faced a similar problem. Therefore, when all artifacts were replicated forcibly, event replication was enabled and the source is responsible for the cleanup rules. The problem occurs if the source is unavailable or deleted.
I see your point now and the idea is good, but it would not work for us. The main issue is that some repositories get updated often while others receive even less then one update per month. We do retain last 30 images for production stuff, so it could take even a few years for some repositories to catch up, therefore the new harbor instance would have to run in a"slave mode" practically forever. It would also be a pain from the configuration stand point as we use puppet/consul-template to configure and maintain VMs and we would have to differentiate between a fully deployed instance with usable image history and a fresh instance with semi broken timestamps.
Thoughts out loud. As a solution, you can create your own service that will receive a list of artifacts via the REST API, sort and replicate them in turn for each repository. Replication can also be performed using the REST API.
Yes, this was one of the first solutions that I came up with, too. Also, I could simply re-push every image one by one with a script by sorting the catalog first, but this presumes that you have sortable tags, which might not always be the case as some devs use commits ids as tags. In the end, rewriting the artifact push time and tag push time post-replication with the artifact's creation timestamp was the simplest, fastest and most reliable solution that required just two sql commands and was fairly safe to use.
Anyway, I can't help myself not to see this as bug. For what purpose would you have a replication feature, that breaks usage of another very important feature - content retention and removal.
This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.
This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.
Expected behavior and actual behavior: Expected: When i replicate complete repository from an endpoint (or even whole endpoint) artifacts should retain their push order so retention rules can efficiently delete the oldest ones. Actual: Artifacts are replicated in random order, which results in deletion of even fresh artifacts when a retention rule is applied.
Steps to reproduce the problem:
The artifacts tags represent the real build and push time of these artifacts to the original repository. There are 9 artifacts of which 4 oldest should be deleted.
The oldest are: 2022.03.17.141325-g217b3a971ac-5 2021.11.09.160838-g6c092d73ead-31 2021.11.02.160814-ge23cfd7935b-1 2021.11.02.155904-ge23cfd7935b-27
As you can see only one of the oldest artifacts is marked to be deleted, because it was replicated in wrong order and its "push time" has changed.
I pressumed that this issue is caused by more worker working simultaneously, which might resulted in some images getting replicated faster, but even setting just one job worker resulted in the same outcome.
Versions: