Open AmitMY opened 3 years ago
Yes right now ArrayXD
can only be used as a column feature type, not a subtype.
With the current Arrow limitations I don't think we'll be able to make it work as a subtype, however it should be possible to allow dimensions of dynamic sizes (Array3D(shape=(None, 137, 2), dtype="float32")
for example since the underlying arrow type allows dynamic sizes.
For now I'd suggest the use of nested Sequence
types. Once we have the dynamic sizes you can update the dataset.
What do you think ?
Yes right now ArrayXD can only be used as a column feature type, not a subtype.
Meaning it can't be nested under Sequence
?
If so, for now I'll just make it a python list and make it with the nested Sequence
type you suggested.
Yea unfortunately.. That's a current limitation with Arrow ExtensionTypes that can't be used in the default Arrow Array objects. We already have an ExtensionArray that allows us to use them as column types but not for subtypes. Maybe we can extend it, I haven't experimented with that yet
Cool So please consider this issue as a feature request for:
Array3D(shape=(None, 137, 2), dtype="float32")
its a way to represent videos, poses, and other cool sequences
@lhoestq well, so sequence of sequences doesn't work either...
pyarrow.lib.ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
Working with Arrow can be quite fun sometimes. You can fix this issue by trying to reduce the writer batch size (same trick than the one used to reduce the RAM usage in https://github.com/huggingface/datasets/issues/741).
Let me know if it works. I haven't investigated yet on https://github.com/huggingface/datasets/issues/741 since I was preparing this week's sprint to add datasets but this is in my priority list for early next week.
The batch size fix doesn't work... not for #741 and not for this dataset I'm trying (DGS corpus) Loading the DGS corpus takes 400GB of RAM, which is fine with me as my machine is large enough
Sorry it doesn't work. Will let you know once I fixed it
Hi @lhoestq , any update on dynamic sized arrays?
(Array3D(shape=(None, 137, 2), dtype="float32")
)
Not yet, I've been pretty busy with the dataset sprint lately but this is something that's been asked several times already. So I'll definitely work on this as soon as I'm done with the sprint and with the RAM issue you reported.
Hi @lhoestq,
Any chance you have some updates on the supporting ArrayXD
as a subtype or support of dynamic sized arrays?
e.g.:
datasets.features.Sequence(datasets.features.Array2D(shape=(137, 2), dtype="float32"))
Array3D(shape=(None, 137, 2), dtype="float32")
Hi ! We haven't worked in this lately and it's not in our very short-term roadmap since it requires a bit a work to make it work with arrow. Though this will definitely be added at one point.
@lhoestq, thanks for the update.
I actually tried to modify some piece of code to make it work. Can you please tell if I missing anything here?
I think that for vast majority of cases it's enough to make first dimension of the array dynamic i.e. shape=(None, 100, 100)
. For that, it's enough to modify class ArrayExtensionArray to output list of arrays of different sizes instead of list of arrays of same sizes (current version)
Below are my modifications of this class.
class ArrayExtensionArray(pa.ExtensionArray):
def __array__(self):
zero_copy_only = _is_zero_copy_only(self.storage.type)
return self.to_numpy(zero_copy_only=zero_copy_only)
def __getitem__(self, i):
return self.storage[i]
def to_numpy(self, zero_copy_only=True):
storage: pa.ListArray = self.storage
size = 1
for i in range(self.type.ndims):
size *= self.type.shape[i]
storage = storage.flatten()
numpy_arr = storage.to_numpy(zero_copy_only=zero_copy_only)
numpy_arr = numpy_arr.reshape(len(self), *self.type.shape)
return numpy_arr
def to_list_of_numpy(self, zero_copy_only=True):
storage: pa.ListArray = self.storage
shape = self.type.shape
arrays = []
for dim in range(1, self.type.ndims):
assert shape[dim] is not None, f"Support only dynamic size on first dimension. Got: {shape}"
first_dim_offsets = np.array([off.as_py() for off in storage.offsets])
for i in range(len(storage)):
storage_el = storage[i:i+1]
first_dim = first_dim_offsets[i+1] - first_dim_offsets[i]
# flatten storage
for dim in range(self.type.ndims):
storage_el = storage_el.flatten()
numpy_arr = storage_el.to_numpy(zero_copy_only=zero_copy_only)
arrays.append(numpy_arr.reshape(first_dim, *shape[1:]))
return arrays
def to_pylist(self):
zero_copy_only = _is_zero_copy_only(self.storage.type)
if self.type.shape[0] is None:
return self.to_list_of_numpy(zero_copy_only=zero_copy_only)
else:
return self.to_numpy(zero_copy_only=zero_copy_only).tolist()
I ran few tests and it works as expected. Let me know what you think.
Thanks for diving into this !
Indeed focusing on making the first dimensions dynamic make total sense (and users could still re-order their dimensions to match this constraint). Your code looks great :) I think it can even be extended to support several dynamic dimensions if we want to.
Feel free to open a PR to include these changes, then we can update our test suite to make sure it works in all use cases. In particular I think we might need a few tweaks to allow it to be converted to pandas (though I haven't tested yet):
from datasets import Dataset, Features, Array3D
# this works
matrix = [[1, 0], [0, 1]]
features = Features({"a": Array3D(dtype="int32", shape=(1, 2, 2))})
d = Dataset.from_dict({"a": [[matrix], [matrix]]})
print(d.to_pandas())
# this should work as well
matrix = [[1, 0], [0, 1]]
features = Features({"a": Array3D(dtype="int32", shape=(None, 2, 2))})
d = Dataset.from_dict({"a": [[matrix], [matrix] * 2]})
print(d.to_pandas())
I'll be happy to help you on this :)
I set up a new dataset, with a sequence of arrays (really, I want to have an array of (None, 137, 2), and the first dimension is dynamic)
But this doesn't work -