google / tensorstore

Library for reading and writing large multi-dimensional arrays.
https://google.github.io/tensorstore/
Other
1.34k stars 120 forks source link

Are write operations with the zarr Driver guaranteed to be thread- and process-safe? #198

Open xantho09 opened 1 week ago

xantho09 commented 1 week ago

Suppose I have an existing on-disk Zarr array. If I were to have two separate processes that:

  1. Open this Zarr array via tensorstore.open
  2. Write to separate regions that potentially share the same chunks within the Zarr array

Are these two write operations guaranteed to write correctly?

For example, suppose my.zarr has a chunk shape of (64,64,64).

# Process 1
path = "path/to/my.zarr"
arr = ts.open(
    {
        "driver": "zarr",
        "kvstore": {"driver": "file", "path": path},
    },
    open=True,
    read=True,
    write=True,
    create=False,
).result()

arr[(0,0,0):(64,64,32)] = 100
# Process 2
path = "path/to/my.zarr"
arr = ts.open(...) # Same as Process 1

arr[(0,0,32):(64,64,64)] = 200

The only mention I could find was in the homepage, under the list of highlights.

Supports safe, efficient access from multiple processes and machines via optimistic concurrency.

And some basic testing seems to suggest that this is indeed true.

However, is this guaranteed to be the case? Is there anything within the documentation that provides this guarantee?

P.S. Out of curiosity, how is the OCC actually implemented? Checking the last modified date of the Zarr chunk in which to write, or something along these lines? P.P.S. Great library, by the way

laramiel commented 1 week ago

To achieve this you need to use transactions, note, however, that this will not work with the s3 driver.

xantho09 commented 1 week ago

I see...

So something like this would be sufficient. Is that correct?

# Process 1
path = "path/to/my.zarr"
arr = ts.open(
    {
        "driver": "zarr",
        "kvstore": {"driver": "file", "path": path},
    },
    open=True,
    read=True,
    write=True,
    create=False,
).result()

with ts.Transaction() as txn:
  arr.with_transaction(txn)[(0,0,0):(64,64,32)] = 100
# Process 2
path = "path/to/my.zarr"
arr = ts.open(...) # Same as Process 1

with ts.Transaction() as txn:
    arr.with_transaction(txn)[(0,0,32):(64,64,64)] = 200

I do have some additional related questions:

Please and thank you.