huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.16k stars 2.67k forks source link

pyarrow.lib.ArrowNotImplementedError: MakeBuilder: cannot construct builder for type extension<arrow.py_extension_type> #887

Open AmitMY opened 3 years ago

AmitMY commented 3 years ago

I set up a new dataset, with a sequence of arrays (really, I want to have an array of (None, 137, 2), and the first dimension is dynamic)

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            # This defines the different columns of the dataset and their types
            features=datasets.Features(
                {
                    "pose": datasets.features.Sequence(datasets.features.Array2D(shape=(137, 2), dtype="float32"))
                }
            ),
            homepage=_HOMEPAGE,
            citation=_CITATION,
        )
    def _generate_examples(self):
        """ Yields examples. """

        yield 1, {
            "pose": [np.zeros(shape=(137, 2), dtype=np.float32)]
        }

But this doesn't work -

pyarrow.lib.ArrowNotImplementedError: MakeBuilder: cannot construct builder for type extension

lhoestq commented 3 years ago

Yes right now ArrayXD can only be used as a column feature type, not a subtype. With the current Arrow limitations I don't think we'll be able to make it work as a subtype, however it should be possible to allow dimensions of dynamic sizes (Array3D(shape=(None, 137, 2), dtype="float32") for example since the underlying arrow type allows dynamic sizes.

For now I'd suggest the use of nested Sequence types. Once we have the dynamic sizes you can update the dataset. What do you think ?

AmitMY commented 3 years ago

Yes right now ArrayXD can only be used as a column feature type, not a subtype.

Meaning it can't be nested under Sequence? If so, for now I'll just make it a python list and make it with the nested Sequence type you suggested.

lhoestq commented 3 years ago

Yea unfortunately.. That's a current limitation with Arrow ExtensionTypes that can't be used in the default Arrow Array objects. We already have an ExtensionArray that allows us to use them as column types but not for subtypes. Maybe we can extend it, I haven't experimented with that yet

AmitMY commented 3 years ago

Cool So please consider this issue as a feature request for:

Array3D(shape=(None, 137, 2), dtype="float32")

its a way to represent videos, poses, and other cool sequences

AmitMY commented 3 years ago

@lhoestq well, so sequence of sequences doesn't work either...

pyarrow.lib.ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
lhoestq commented 3 years ago

Working with Arrow can be quite fun sometimes. You can fix this issue by trying to reduce the writer batch size (same trick than the one used to reduce the RAM usage in https://github.com/huggingface/datasets/issues/741).

Let me know if it works. I haven't investigated yet on https://github.com/huggingface/datasets/issues/741 since I was preparing this week's sprint to add datasets but this is in my priority list for early next week.

AmitMY commented 3 years ago

The batch size fix doesn't work... not for #741 and not for this dataset I'm trying (DGS corpus) Loading the DGS corpus takes 400GB of RAM, which is fine with me as my machine is large enough

lhoestq commented 3 years ago

Sorry it doesn't work. Will let you know once I fixed it

AmitMY commented 3 years ago

Hi @lhoestq , any update on dynamic sized arrays? (Array3D(shape=(None, 137, 2), dtype="float32"))

lhoestq commented 3 years ago

Not yet, I've been pretty busy with the dataset sprint lately but this is something that's been asked several times already. So I'll definitely work on this as soon as I'm done with the sprint and with the RAM issue you reported.

rpowalski commented 3 years ago

Hi @lhoestq, Any chance you have some updates on the supporting ArrayXD as a subtype or support of dynamic sized arrays?

e.g.: datasets.features.Sequence(datasets.features.Array2D(shape=(137, 2), dtype="float32")) Array3D(shape=(None, 137, 2), dtype="float32")

lhoestq commented 3 years ago

Hi ! We haven't worked in this lately and it's not in our very short-term roadmap since it requires a bit a work to make it work with arrow. Though this will definitely be added at one point.

rpowalski commented 3 years ago

@lhoestq, thanks for the update.

I actually tried to modify some piece of code to make it work. Can you please tell if I missing anything here? I think that for vast majority of cases it's enough to make first dimension of the array dynamic i.e. shape=(None, 100, 100). For that, it's enough to modify class ArrayExtensionArray to output list of arrays of different sizes instead of list of arrays of same sizes (current version) Below are my modifications of this class.

class ArrayExtensionArray(pa.ExtensionArray):
    def __array__(self):
        zero_copy_only = _is_zero_copy_only(self.storage.type)
        return self.to_numpy(zero_copy_only=zero_copy_only)

    def __getitem__(self, i):
        return self.storage[i]

    def to_numpy(self, zero_copy_only=True):
        storage: pa.ListArray = self.storage
        size = 1
        for i in range(self.type.ndims):
            size *= self.type.shape[i]
            storage = storage.flatten()
        numpy_arr = storage.to_numpy(zero_copy_only=zero_copy_only)
        numpy_arr = numpy_arr.reshape(len(self), *self.type.shape)
        return numpy_arr

    def to_list_of_numpy(self, zero_copy_only=True):
        storage: pa.ListArray = self.storage
        shape = self.type.shape
        arrays = []
        for dim in range(1, self.type.ndims):
            assert shape[dim] is not None, f"Support only dynamic size on first dimension. Got: {shape}"

        first_dim_offsets = np.array([off.as_py() for off in storage.offsets])
        for i in range(len(storage)):
            storage_el = storage[i:i+1]
            first_dim = first_dim_offsets[i+1] - first_dim_offsets[i]
            # flatten storage
            for dim in range(self.type.ndims):
                storage_el = storage_el.flatten()

            numpy_arr = storage_el.to_numpy(zero_copy_only=zero_copy_only)
            arrays.append(numpy_arr.reshape(first_dim, *shape[1:]))

        return arrays

    def to_pylist(self):
        zero_copy_only = _is_zero_copy_only(self.storage.type)
        if self.type.shape[0] is None:
            return self.to_list_of_numpy(zero_copy_only=zero_copy_only)
        else:
            return self.to_numpy(zero_copy_only=zero_copy_only).tolist()

I ran few tests and it works as expected. Let me know what you think.

lhoestq commented 3 years ago

Thanks for diving into this !

Indeed focusing on making the first dimensions dynamic make total sense (and users could still re-order their dimensions to match this constraint). Your code looks great :) I think it can even be extended to support several dynamic dimensions if we want to.

Feel free to open a PR to include these changes, then we can update our test suite to make sure it works in all use cases. In particular I think we might need a few tweaks to allow it to be converted to pandas (though I haven't tested yet):

from datasets import Dataset, Features, Array3D

# this works
matrix = [[1, 0], [0, 1]]
features = Features({"a": Array3D(dtype="int32", shape=(1, 2, 2))})
d = Dataset.from_dict({"a": [[matrix], [matrix]]})
print(d.to_pandas())

# this should work as well
matrix = [[1, 0], [0, 1]]
features = Features({"a": Array3D(dtype="int32", shape=(None, 2, 2))})
d = Dataset.from_dict({"a": [[matrix], [matrix] * 2]})
print(d.to_pandas())

I'll be happy to help you on this :)