distributed appending? (like `zarr.append()`)

kyoungrok0517 commented 1 year ago

Hello. First I'd like to refer to my previous issue #67, which explains my use case.

I want to append embeddings to tensorstore from multiple processes of pods. There is append() in zarr, but I couldn't find a equivalent function in tensorstore. How can I achieve the similar in tensorstore? below is my writing code. Thanks!

    def write_on_batch_end(
        self,
        trainer,
        pl_module,
        prediction: Any,
        batch_indices: List[int],
        batch: Any,
        batch_idx: int,
        dataloader_idx: int,
    ):
        embed = prediction

        # append to storage
        # this function is called from multiple processes
        ...

jbms commented 1 year ago

"Append" basically corresponds to resizing the bounds of one dimension, and then writing to the new portion.

You can do these steps separately in TensorStore currently (resize and then write) but that will not work correctly if done concurrently from multiple machines. Note that zarr-python does not have any special atomic append support, it is just a convenience interface for resizing and then writing.

There are a few issues with making this work correctly:

In general what is needed is to allocate to each process a distinct portion of the array's domain. In principle that could be supported with a new variant of resize that atomically increases the size of a bound and returns both the new and old bound; the existing resize operation in TensorStore already atomically updates the metadata, it just doesn't provide an API that allows for an atomic append.
If a process allocates a portion of the array's domain, and then fails before fully writing to that portion of the domain, then you will end up with a partially written array. To deal with that correctly may require an additional transaction mechanism on top.

asparsa commented 3 weeks ago

Is there a resize function for the zarr3 driver in C++? How does that work? Can I change the dimensions with it? I have a 1D {N^6}array that needs to be reshaped to {N^3,N^3} and {N^2,N^2,N^2}.

google / tensorstore

distributed appending? (like `zarr.append()`) #72