Closed Ben-Epstein closed 2 years ago
Yeah, I don't think there is escaping this. However it might be worth it to understand why this happens, and how to get around it (if it suits your case).
So when you do split_random
you get bunch of shallow copies of the original dataframe. For this to be possible, a random mask is generated. So depending on the number of splits, and the total size of your dataframe, evaluating that mask can be quite computationally expensive. You can think of it as a rather complicated (and expensive) filter. It makes sense right, since there are is no "rule" of how the splits are done, no way to optimize.. so this is the slowest single filter you can have probably (this is not a challenge :P ).
What you can do alternatively, is shuffle your data, and export the shuffled version to disk. Then you can just do split
which will be ordered and thus much faster to evaluate. You still have to "pay the price", but only once instead of every time when you call a certain split. It might help with reproducibility too.. (instead of using a random state).
@JovanVeljanoski / anyone else interested in another approach. What i'm going to do instead (since my dataframes always have an id), is the following
Each of those by themselves is quite fast, so I'd expect this to be fast (although i haven't tested it yet). Does that seem reasonable?
For people here without IDs in their dataframes, you can actually add an ID as a virtual column pretty easily with vaex https://vaex.io/docs/api.html#vaex.vrange
UPDATE: It's actually slower
This seems valid, but it is basically the same thing that split_random
does.. Would be very interesting/surprising if this is faster.
Also, note that what you are doing is filtering, so your train/val/test sets do come with "history" (i.e. this filter). So you might want to do .extract()
on them or something like that (@maartenbreddels will know better on this).
I'm not sure it's the same as split_random, conceptually it is, but it's using df.take. Happy to hear about performance differences, and we might be able to improve split_random based on that.
@maartenbreddels I didn't notice any major performance differences, but I did notice something extremely interesting. I am having a tough time reproducing it on Jupyter, but in our server it's 100% consistent and reproducible.
Here is the workflow:
df_filtered = df_copy[df_copy["id"].isin([1,5,3,6,4...])]
df_subset, _ = df_filtered.split_random(into=0.3..., random_state=...)
res = df_subset.to_records()
Whenever we filter on the list of values (the isin
line), then call the split_random, then take any action (to_records
etc) we get
TypeError: You have passed in an object for which we cannot determine a fingerprint
There are 2 ways I've found to solve it:
==
many times for each id - this is not scalable so I cannot use thisIt's some odd combination of an isin
filter followed by a split_random
call that loses the fingerprint. Any other combination and everything works as expected. I have no idea why, but maybe that helps in some way?
Here's the full trace. I'm really confused why dask is being invoked (we aren't touching dask at all) log.txt
EDIT: Another interesting thing I noticed when testing was after the first isin
filter, if I call extract
I also lost the fingerprint. Would you be able to explain a bit better how exactly that fingerprint is calculated and maintained? It's a bit of a black box for me so it's hard to help diagnose
Would you mind opening a new issue for this?
https://github.com/vaexio/vaex/issues/1752 Thanks @maartenbreddels
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description
When scaling to 10s of millions of rows, random split gets very slow. When testing with 1B rows, I couldn't get it to finish.
It doesn't seem to be faster when testing with only 3 columns, either
Software information
import vaex; vaex.__version__)
:Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).