Open franz101 opened 2 years ago
Hi ! We're using the Arrow format for the datasets, and PyArrow tensors are not part of the Arrow format AFAIK:
There is no direct support in the arrow columnar format to store Tensors as column values.
source: https://github.com/apache/arrow/issues/4802#issuecomment-508494694
@wesm @rok its been around three years. any updates, regarding dataset arrow tensor support? 🙏 I know you must be very busy, would appreciate to learn what is the state of art. I saw the PR is still open #8510
Hey @franz101 & @lhoestq! There is a plan and a PR to create an ExtensionArray of Tensors of equal sizes as well as a plan to do the same for Tensors of different sizes ARROW-8714.
The work stalled a little because it was not clear where TensorArray would live. However Arrow community recently agreed to make a well-known-extension-type document and I would like https://github.com/apache/arrow/pull/8510 to land there and add an implementation to C++/Python + another language. Is that something you would find beneficial to you?
that is a great update, thank you. it looks like this feature would benefit datasets implementation of ArrayExtensionArray. Is that correct @eladsegal @lhoestq?
TensorArray sounds great ! Looking forward to it :)
We've had our own ExtensionArray for fixed shape tensors for a while now, hoping to see something more standardized by the arrow community.
Also super interested in the extension array for tensors of different sizes cc @mariosasko
FixedShapeTensor ExtensionType was merged and will be in Arrow 12.0.0 (release is planned mid April).
@rok Thanks for keeping us updated! I think it's best to introduce a new feature type that would use this extension type under the hood. I'll create an issue to discuss the design with the community in the coming days.
Also, is there a tentative time frame for the variable-shape Tensor extension type?
@mariosasko please tag me in the discussion, perhaps I can contribute.
As for the variable shape tensor array - I'd be interested in working on it but didn't see much interest in community yet. Are you saying huggingface/datasets
could use it?
pyarrow 12 is out 🎉, will have a look if I can work on it for the ExtensionArray
I think these two issues need to be fixed first on the Arrow side before adding the tensor feature type here: https://github.com/apache/arrow/issues/35573 and https://github.com/apache/arrow/issues/35599.
@rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the nested_tensor
API) support them, so it makes sense for us to do the same eventually (the Ray project has an extension type to support this case)
@rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the
nested_tensor
API) support them, so it makes sense for us to do the same eventually (the Ray project has an extension type to support this case)
That does make sense indeed. We should probably also be careful about memory layout to enable zero-copy interface to TF/PyTorch.
So there is no way we can use pyarrow.Tensor ?
Not with with the Arrow format, and therefore not in datasets
. But they released a new FixedShapeTensorArray to store tensors in Arrow format. We plan to support this in datasets
at one point !
There is also an open issue to enable the conversion of pyarrow.Tensor
to pyarrow.FixedShapeTensorType
: https://github.com/apache/arrow/issues/35068. This way one could indirectly use pyarrow.Tensor
in Arrow format.
We started a mailing list discussion about potential VariableShapeTensor
extension array, please check it out and give feedback. For more details here's also a PR https://github.com/apache/arrow/pull/37166.
Kindly ask what's the recent progress?
Feature request
I was going the discussion of converting tensors to lists. Is there a way to leverage pyarrow's Tensors for nested arrays / embeddings?
For example:
Apache docs
Maybe this belongs into the pyarrow features / repo.
Motivation
Working with big data, we need to make sure to use the best data structures and IO out there
Your contribution
Can try to a PR if code changes necessary