Open Ben-Epstein opened 1 year ago
The memory is exploding going to hdf5 because we keep
.take
ing
Yeah, when using sliced arrays, that seems to be the case. It will try to concatenate them first, which will explode the memory use!
This is quite a bad arrow situation... :/ digesting this
@maartenbreddels i updated the function to use the array_types take
function, but i wasn't able to update the others. take
is used in a bunch of places and they aren't always from a pyarrow array. Without typing, it's not clear to me when it's on a dataframe, or a dataset, or something else.
I tried to dig into the dataframe take
which led to the dataset take
and then this DatasetTake
class but I can't really understand what it's doing. Would you mind helping me out with that part?
I'll scan over it tomorrow!
This directly addresses https://github.com/vaexio/vaex/issues/2335 And is directly the fix for https://issues.apache.org/jira/browse/ARROW-9773 (which is now https://github.com/apache/arrow/issues/33049)
I believe fixing all of the
.take
s to.slice
would also fix https://github.com/vaexio/vaex/issues/2334 because.take
uses memory, but.slice
is zero-copy. The memory is exploding going to hdf5 because we keep.take
ingYou can see huggingface datasets does the same thing: https://github.com/huggingface/datasets/pull/645/files
That being said, there are a number of other places that vaex uses
.take
which should be fixed. But because of the lack of typing in the vaex repo, it's hard for me to know which ones are pyarrow arrays, which are pyarrow tables, and which are numpy arrays. I'm happy to help move the rest over, but I would need some guidance.Here are all of the places
.take
is used