vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

dont use take with arrow #2336

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

This directly addresses https://github.com/vaexio/vaex/issues/2335 And is directly the fix for https://issues.apache.org/jira/browse/ARROW-9773 (which is now https://github.com/apache/arrow/issues/33049)

I believe fixing all of the .takes to .slice would also fix https://github.com/vaexio/vaex/issues/2334 because .take uses memory, but .slice is zero-copy. The memory is exploding going to hdf5 because we keep .takeing

You can see huggingface datasets does the same thing: https://github.com/huggingface/datasets/pull/645/files

That being said, there are a number of other places that vaex uses .take which should be fixed. But because of the lack of typing in the vaex repo, it's hard for me to know which ones are pyarrow arrays, which are pyarrow tables, and which are numpy arrays. I'm happy to help move the rest over, but I would need some guidance.

Here are all of the places .take is used

./packages/vaex-core/vaex/functions.py:    values = choices.take(indices)
./packages/vaex-core/vaex/dataframe.py:                    used_keys = keys.take(codes)
./packages/vaex-core/vaex/dataframe.py:                        keys = keys.take(indices)
./packages/vaex-core/vaex/dataframe.py:    def take(self, indices, filtered=True, dropfilter=True):
./packages/vaex-core/vaex/dataframe.py:        >>> df.take([0,2])
./packages/vaex-core/vaex/dataframe.py:        df.dataset = df.dataset.take(indices)
./packages/vaex-core/vaex/dataframe.py:        return self.take(indices)
./packages/vaex-core/vaex/dataframe.py:        return self.take(indices).split(into)
./packages/vaex-core/vaex/dataframe.py:        return self.take(indices)
./packages/vaex-core/vaex/cpu.py:                keys = keys.take(indices)
./packages/vaex-core/vaex/cpu.py:        keys = keys.take(order)
./packages/vaex-core/vaex/join.py:            left = left.concat(left.take(lookup_left))
./packages/vaex-core/vaex/join.py:            left = left.take(left_indices_matched, filtered=False, dropfilter=False)
./packages/vaex-core/vaex/join.py:            right_dataset = right_dataset.take(lookup, masked=any(lookup_masked))
./packages/vaex-core/vaex/dataset.py:    def take(self, indices, masked=False):
./packages/vaex-core/vaex/dataset.py:        return DatasetTake(self, indices, masked=masked)
./packages/vaex-core/vaex/dataset.py:class DatasetTake(DatasetDecorator):
./packages/vaex-core/vaex/hash.py:        keys = pa.compute.take(keys, indices)
./packages/vaex-core/vaex/array_types.py:def take(ar, indices):
./packages/vaex-core/vaex/array_types.py:    return ar.take(indices)
./packages/vaex-core/vaex/groupby.py:                        self.bin_values = pa.compute.take(self.bin_values, self.sort_indices)
./packages/vaex-core/vaex/groupby.py:                        bin_values[field.name] = ar.take(indices)
./packages/vaex-core/vaex/groupby.py:                    bin_values[parent.label] = parent.bin_values.take(indices)
./packages/vaex-core/vaex/groupby.py:            self.bin_values = pa.compute.take(self.bin_values, sort_indices)
./packages/vaex-core/vaex/groupby.py:            values = pa.compute.take(values, indices)
./packages/vaex-core/vaex/groupby.py:                        ar = np.take(ar, sort_indices, axis=i)
./packages/vaex-core/vaex/groupby.py:                            columns[by.label] = vaex.array_types.take(by.bin_values, index)
./packages/vaex-core/vaex/groupby.py:                        columns[by.label] = by.bin_values.take(indices)
./packages/vaex-core/vaex/column.py:            ar = ar_unfiltered.take(vaex.array_types.to_arrow(take_indices))
./packages/vaex-core/vaex/column.py:        x = x.dictionary.take(x.indices)  # equivalent to PyArrow 5.0.0's dictionary_decode() but backwards compatible
maartenbreddels commented 1 year ago

The memory is exploding going to hdf5 because we keep .takeing

Yeah, when using sliced arrays, that seems to be the case. It will try to concatenate them first, which will explode the memory use!

This is quite a bad arrow situation... :/ digesting this

Ben-Epstein commented 1 year ago

@maartenbreddels i updated the function to use the array_types take function, but i wasn't able to update the others. take is used in a bunch of places and they aren't always from a pyarrow array. Without typing, it's not clear to me when it's on a dataframe, or a dataset, or something else.

I tried to dig into the dataframe take which led to the dataset take and then this DatasetTake class but I can't really understand what it's doing. Would you mind helping me out with that part?

maartenbreddels commented 1 year ago

I'll scan over it tomorrow!