vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] split_random is slow on larger datasets #1693

Closed Ben-Epstein closed 2 years ago

Ben-Epstein commented 2 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description

When scaling to 10s of millions of rows, random split gets very slow. When testing with 1B rows, I couldn't get it to finish.

image

It doesn't seem to be faster when testing with only 3 columns, either image

Software information

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

JovanVeljanoski commented 2 years ago

Yeah, I don't think there is escaping this. However it might be worth it to understand why this happens, and how to get around it (if it suits your case).

So when you do split_random you get bunch of shallow copies of the original dataframe. For this to be possible, a random mask is generated. So depending on the number of splits, and the total size of your dataframe, evaluating that mask can be quite computationally expensive. You can think of it as a rather complicated (and expensive) filter. It makes sense right, since there are is no "rule" of how the splits are done, no way to optimize.. so this is the slowest single filter you can have probably (this is not a challenge :P ).

What you can do alternatively, is shuffle your data, and export the shuffled version to disk. Then you can just do split which will be ordered and thus much faster to evaluate. You still have to "pay the price", but only once instead of every time when you call a certain split. It might help with reproducibility too.. (instead of using a random state).

Ben-Epstein commented 2 years ago

@JovanVeljanoski / anyone else interested in another approach. What i'm going to do instead (since my dataframes always have an id), is the following

  1. Get all unique IDs for my dataframe
  2. take a random sample of those
  3. filter the dataframe based on those IDs

Each of those by themselves is quite fast, so I'd expect this to be fast (although i haven't tested it yet). Does that seem reasonable?

For people here without IDs in their dataframes, you can actually add an ID as a virtual column pretty easily with vaex https://vaex.io/docs/api.html#vaex.vrange

UPDATE: It's actually slower

image

JovanVeljanoski commented 2 years ago

This seems valid, but it is basically the same thing that split_random does.. Would be very interesting/surprising if this is faster.

Also, note that what you are doing is filtering, so your train/val/test sets do come with "history" (i.e. this filter). So you might want to do .extract() on them or something like that (@maartenbreddels will know better on this).

maartenbreddels commented 2 years ago

I'm not sure it's the same as split_random, conceptually it is, but it's using df.take. Happy to hear about performance differences, and we might be able to improve split_random based on that.

Ben-Epstein commented 2 years ago

@maartenbreddels I didn't notice any major performance differences, but I did notice something extremely interesting. I am having a tough time reproducing it on Jupyter, but in our server it's 100% consistent and reproducible.

Here is the workflow:

  1. Download file from minio to local storage
  2. Read with vaex, store dataframe in a class attribute
  3. Filter dataframe on a list of ids df_filtered = df_copy[df_copy["id"].isin([1,5,3,6,4...])]
  4. Take a random split df_subset, _ = df_filtered.split_random(into=0.3..., random_state=...)
  5. Get results res = df_subset.to_records()

Whenever we filter on the list of values (the isin line), then call the split_random, then take any action (to_records etc) we get TypeError: You have passed in an object for which we cannot determine a fingerprint

There are 2 ways I've found to solve it:

  1. Filter using == many times for each id - this is not scalable so I cannot use this
  2. Use my method above (take a random sample of IDs and filter on those) - this works perfectly

It's some odd combination of an isin filter followed by a split_random call that loses the fingerprint. Any other combination and everything works as expected. I have no idea why, but maybe that helps in some way?

Here's the full trace. I'm really confused why dask is being invoked (we aren't touching dask at all) log.txt

EDIT: Another interesting thing I noticed when testing was after the first isin filter, if I call extract I also lost the fingerprint. Would you be able to explain a bit better how exactly that fingerprint is calculated and maintained? It's a bit of a black box for me so it's hard to help diagnose

maartenbreddels commented 2 years ago

Would you mind opening a new issue for this?

Ben-Epstein commented 2 years ago

https://github.com/vaexio/vaex/issues/1752 Thanks @maartenbreddels