Open alanking opened 10 months ago
Right, we only have to make the replication jobs each 'check' before doing any work to replicate. Then, if it's already in the desired state, return early... no errors.
Oh, the first three fire at the same time... so it still might be a little noisy. Hmm....
Bug Report
iRODS Version, OS and Version
iRODS server: 4.3.1 Storage tiering plugin: 4.3.1 OS: centos 7
What did you try to do?
Set up a tier group with 3 resource hierarchies:
Then I put an object in...
Then try to tier it out:
Expected behavior
I expected the tier out to occur with no errors or issues.
Observed behavior (including steps to reproduce, if applicable)
After a while, 3 migrations are scheduled:
The tier-out succeeds!
...but there are 2 migration tasks remaining:
...and a bunch of errors in the log. The first errors have to do with duplicate entries in the catalog:
And later, others appear having to do with a missing source replica:
Eventually, the migration jobs will fail a sufficient number of times and are removed from the queue.
It seems like all the migration jobs start at the same time and one of them wins the race, locking out the others. It is mildly concerning that it gets all the way to the point of registering the physical path before an error occurs (the "database race", I assume: https://github.com/irods/irods/issues/5742#issuecomment-905439542) but after that scare, logical locking should keep things sane.
If we consider the "tracked" replica to be the "representative" replica for the group of replicas, it is the only one that needs to be scheduled for replication. The plugin seems to take care of the trimming of the other replicas, so we don't need to worry about that.
Open to other ideas.