irods / irods

Open Source Data Management Software
https://irods.org
BSD 3-Clause "New" or "Revised" License
446 stars 141 forks source link

Add zone-wide logical locking on data object writes/creates #3848

Closed alanking closed 3 years ago

alanking commented 6 years ago

There is a possibility for a race condition for clients writing to the same logical path with a force-flag-enabled iput. Example scenario:

The goal of this issue is to prevent the replication operation of Client 2 in this scenario. This should be done with the following approach:

When a file object is force-put (iput -f) to a replication resource, we should be checking to see if the data object has replicas in that resource tree already. If so, we should set a flag to indicate that the data object should not be replicated.

There should be a context string configuration option for the replication resource to allow for this behavior called skip_replication_on_overwrite.

As a result of preventing the extra replication, when Client 2 has completed its transfer, other replicas (presumably from Client 1) will not be overwritten, but they should be marked stale (no &). Any subsequent rebalance will repave all the stale replicas. Please include tests for this.

trel commented 6 years ago

This scenario that created this issue included a replication resource with a random resource as a child. When Client 2 was finished, a new 'extra' replica was created due to the random child picking a new destination (different from the replica location selected by that same random resource for Client 1).

key-value should probably be skip_replication_on_overwrite=true to trigger this behavior.

keithj commented 6 years ago

@jc18 has asked me to comment on our use of multiple clients writing (with /force/) concurrently to the same data object. While it's useful for us to know of this issue, we don't believe that it affects our use of iRODS because of the control we have over the way our iRODS clients are launched and which data are processed concurrently.

Our clients each operate on a single sequencing run at a time. Sequencing data in a run have globally unique file paths both on local disk and as iRODS data objects. Clients are launched by LSF using a pre-exec condition which checks that no other jobs are already running on any data from that run. As each client is single-threaded and uses a single connection to iRODS, we avoid the condition described above.

alanking commented 6 years ago

This enhancement was meant to address the problem in #3665 but it is clear that it will not help there. As such, this issue is no longer urgent and will be bumped to a later version.

alanking commented 4 years ago

The race condition described above will addressed by what we are calling zone-wide logical locking on data objects. 3 new replica statuses will be needed in addition to the 3rd state being introduced in #4343:

In order to know which state a replica needs to go to while in read lock, we need to use the data_status column to store reference count:

alanking commented 4 years ago

After some more thought and discussion, we have a slightly tweaked version of the above proposal:

We shall implement 2 new replica statuses: read lock and write lock.

When an object is opened, the caller (ideally on the server side of an API request) should provide 3 stanzas which will be given to the finalize API described in #4331. These will describe the list of replicas and their statuses (and possibly other information) which will atomically update as part of the finalize process in the event that the operation succeeds, fails, or needs to be "rolled back". In this way, the original status of all the replicas is preserved, eliminating the need for a 2nd read lock, and there is a path forward for write-locked replicas whose finalized replica statuses are uncertain (e.g. failure or no-bytes-written).

The JSON format could look something like this...

{
    "data_id": 10116,
    "previous": [
        {
            "repl_num": 0,
            "status": 1
        },
        ...
    ],
    "success": [
        {
            "repl_num": 0,
            "status": 0
        },
        ...
    ],
    "failure": [
        {
            "repl_num": 0,
            "status": 0
        },
        ...
    ]
}