Add zone-wide logical locking on data object writes/creates

alanking commented 6 years ago

[x] master
[x] 4-2-stable

There is a possibility for a race condition for clients writing to the same logical path with a force-flag-enabled iput. Example scenario:

Client 1 begins a put of a large file to a replication resource
- iRODS registers a zero-length file to represent the data object in the catalog
- Actual data transfer begins (i.e. not yet complete)
Client 2 begins a put to the same logical path with the force flag enabled
- Begins an overwrite operation since the data object already exists in the catalog
- Since there is no write lock in iRODS, OS begins writing to the same file at the same time
Once data transfer completes for Client 1, replication operation is initiated
Replication operation will begin when Client 2's transfer completes as well

The goal of this issue is to prevent the replication operation of Client 2 in this scenario. This should be done with the following approach:

When a file object is force-put (iput -f) to a replication resource, we should be checking to see if the data object has replicas in that resource tree already. If so, we should set a flag to indicate that the data object should not be replicated.

There should be a context string configuration option for the replication resource to allow for this behavior called skip_replication_on_overwrite.

As a result of preventing the extra replication, when Client 2 has completed its transfer, other replicas (presumably from Client 1) will not be overwritten, but they should be marked stale (no &). Any subsequent rebalance will repave all the stale replicas. Please include tests for this.

trel commented 6 years ago

This scenario that created this issue included a replication resource with a random resource as a child. When Client 2 was finished, a new 'extra' replica was created due to the random child picking a new destination (different from the replica location selected by that same random resource for Client 1).

key-value should probably be skip_replication_on_overwrite=true to trigger this behavior.

keithj commented 6 years ago

@jc18 has asked me to comment on our use of multiple clients writing (with /force/) concurrently to the same data object. While it's useful for us to know of this issue, we don't believe that it affects our use of iRODS because of the control we have over the way our iRODS clients are launched and which data are processed concurrently.

Our clients each operate on a single sequencing run at a time. Sequencing data in a run have globally unique file paths both on local disk and as iRODS data objects. Clients are launched by LSF using a pre-exec condition which checks that no other jobs are already running on any data from that run. As each client is single-threaded and uses a single connection to iRODS, we avoid the condition described above.

alanking commented 6 years ago

This enhancement was meant to address the problem in #3665 but it is clear that it will not help there. As such, this issue is no longer urgent and will be bumped to a later version.

alanking commented 4 years ago

The race condition described above will addressed by what we are calling zone-wide logical locking on data objects. 3 new replica statuses will be needed in addition to the 3rd state being introduced in #4343:

0: stale (X)
1: good (&)
- The traditional replica statuses
2: intermediate (?)
- Indicates that this replica is actively being written (intermediate)
- Locks out any additional opens on this replica
3: read lock (X)
4: read lock (&)
- Allows a read to continue even as another replica is being written
- Retains the original state of the replica (3 - stale; 4 - good)
- On close, the replica will check itself against the catalog to see how it should update its state (finalize)
- Locks out any additional opens FOR WRITE (see ref count notes below)
5: write lock (?)
- Locks out any additional opens on this replica
- This is the state of a replica when one of its sibling replicas is being actively written (i.e. in the intermediate state (2))

In order to know which state a replica needs to go to while in read lock, we need to use the data_status column to store reference count:

Indicates that a replica is open for read and how many fds are pointing at this replica
If refcount is not 0, stay in state 3 or 4
If refcount drops to 0, go to state 0, 1, or 5

alanking commented 4 years ago

After some more thought and discussion, we have a slightly tweaked version of the above proposal:

We shall implement 2 new replica statuses: read lock and write lock.

When an object is opened, the caller (ideally on the server side of an API request) should provide 3 stanzas which will be given to the finalize API described in #4331. These will describe the list of replicas and their statuses (and possibly other information) which will atomically update as part of the finalize process in the event that the operation succeeds, fails, or needs to be "rolled back". In this way, the original status of all the replicas is preserved, eliminating the need for a 2nd read lock, and there is a path forward for write-locked replicas whose finalized replica statuses are uncertain (e.g. failure or no-bytes-written).

The JSON format could look something like this...

{
    "data_id": 10116,
    "previous": [
        {
            "repl_num": 0,
            "status": 1
        },
        ...
    ],
    "success": [
        {
            "repl_num": 0,
            "status": 0
        },
        ...
    ],
    "failure": [
        {
            "repl_num": 0,
            "status": 0
        },
        ...
    ]
}

irods / irods

Add zone-wide logical locking on data object writes/creates #3848

[x] 4-2-stable