huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Use pyarrow Tensor dtype #5272

Open franz101 opened 2 years ago

franz101 commented 2 years ago

Feature request

I was going the discussion of converting tensors to lists. Is there a way to leverage pyarrow's Tensors for nested arrays / embeddings?

For example:

import pyarrow as pa
import numpy as np
x = np.array([[2, 2, 4], [4, 5, 100]], np.int32)
pa.Tensor.from_numpy(x, dim_names=["dim1","dim2"])

Apache docs

Maybe this belongs into the pyarrow features / repo.

Motivation

Working with big data, we need to make sure to use the best data structures and IO out there

Your contribution

Can try to a PR if code changes necessary

lhoestq commented 2 years ago

Hi ! We're using the Arrow format for the datasets, and PyArrow tensors are not part of the Arrow format AFAIK:

There is no direct support in the arrow columnar format to store Tensors as column values.

source: https://github.com/apache/arrow/issues/4802#issuecomment-508494694

franz101 commented 2 years ago

@wesm @rok its been around three years. any updates, regarding dataset arrow tensor support? 🙏 I know you must be very busy, would appreciate to learn what is the state of art. I saw the PR is still open #8510

rok commented 2 years ago

Hey @franz101 & @lhoestq! There is a plan and a PR to create an ExtensionArray of Tensors of equal sizes as well as a plan to do the same for Tensors of different sizes ARROW-8714.

rok commented 2 years ago

The work stalled a little because it was not clear where TensorArray would live. However Arrow community recently agreed to make a well-known-extension-type document and I would like https://github.com/apache/arrow/pull/8510 to land there and add an implementation to C++/Python + another language. Is that something you would find beneficial to you?

franz101 commented 2 years ago

that is a great update, thank you. it looks like this feature would benefit datasets implementation of ArrayExtensionArray. Is that correct @eladsegal @lhoestq?

lhoestq commented 2 years ago

TensorArray sounds great ! Looking forward to it :)

We've had our own ExtensionArray for fixed shape tensors for a while now, hoping to see something more standardized by the arrow community.

Also super interested in the extension array for tensors of different sizes cc @mariosasko

rok commented 1 year ago

FixedShapeTensor ExtensionType was merged and will be in Arrow 12.0.0 (release is planned mid April).

mariosasko commented 1 year ago

@rok Thanks for keeping us updated! I think it's best to introduce a new feature type that would use this extension type under the hood. I'll create an issue to discuss the design with the community in the coming days.

Also, is there a tentative time frame for the variable-shape Tensor extension type?

rok commented 1 year ago

@mariosasko please tag me in the discussion, perhaps I can contribute.

As for the variable shape tensor array - I'd be interested in working on it but didn't see much interest in community yet. Are you saying huggingface/datasets could use it?

franz101 commented 1 year ago

pyarrow 12 is out 🎉, will have a look if I can work on it for the ExtensionArray

mariosasko commented 1 year ago

I think these two issues need to be fixed first on the Arrow side before adding the tensor feature type here: https://github.com/apache/arrow/issues/35573 and https://github.com/apache/arrow/issues/35599.

@rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the nested_tensor API) support them, so it makes sense for us to do the same eventually (the Ray project has an extension type to support this case)

rok commented 1 year ago

@rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the nested_tensor API) support them, so it makes sense for us to do the same eventually (the Ray project has an extension type to support this case)

That does make sense indeed. We should probably also be careful about memory layout to enable zero-copy interface to TF/PyTorch.

hfawaz commented 1 year ago

So there is no way we can use pyarrow.Tensor ?

lhoestq commented 1 year ago

Not with with the Arrow format, and therefore not in datasets. But they released a new FixedShapeTensorArray to store tensors in Arrow format. We plan to support this in datasets at one point !

AlenkaF commented 1 year ago

There is also an open issue to enable the conversion of pyarrow.Tensor to pyarrow.FixedShapeTensorType: https://github.com/apache/arrow/issues/35068. This way one could indirectly use pyarrow.Tensor in Arrow format.

rok commented 1 year ago

We started a mailing list discussion about potential VariableShapeTensor extension array, please check it out and give feedback. For more details here's also a PR https://github.com/apache/arrow/pull/37166.

npuichigo commented 1 week ago

Kindly ask what's the recent progress?