google / tensorstore

Library for reading and writing large multi-dimensional arrays.
https://google.github.io/tensorstore/
Other
1.34k stars 120 forks source link

distributed appending? (like `zarr.append()`) #72

Open kyoungrok0517 opened 1 year ago

kyoungrok0517 commented 1 year ago

Hello. First I'd like to refer to my previous issue #67, which explains my use case.

I want to append embeddings to tensorstore from multiple processes of pods. There is append() in zarr, but I couldn't find a equivalent function in tensorstore. How can I achieve the similar in tensorstore? below is my writing code. Thanks!

    def write_on_batch_end(
        self,
        trainer,
        pl_module,
        prediction: Any,
        batch_indices: List[int],
        batch: Any,
        batch_idx: int,
        dataloader_idx: int,
    ):
        embed = prediction

        # append to storage
        # this function is called from multiple processes
        ...
jbms commented 1 year ago

"Append" basically corresponds to resizing the bounds of one dimension, and then writing to the new portion.

You can do these steps separately in TensorStore currently (resize and then write) but that will not work correctly if done concurrently from multiple machines. Note that zarr-python does not have any special atomic append support, it is just a convenience interface for resizing and then writing.

There are a few issues with making this work correctly:

asparsa commented 3 weeks ago

Is there a resize function for the zarr3 driver in C++? How does that work? Can I change the dimensions with it? I have a 1D {N^6}array that needs to be reshaped to {N^3,N^3} and {N^2,N^2,N^2}.