vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

Closed NickCrews closed 2 years ago

NickCrews commented 2 years ago

Fixes https://github.com/vaexio/vaex/issues/2084

I just added tests for dropna(), because I thought the verbosity wouldn't be worth it to add tests for dropinf, etc. But let me know what you think.

I also thought about putting the if/else check in _filter_all outside of the loop, to save a bit of overhead, but I figured that looping through even ~1000 columns would still be overshadowed by the actual going through the data.

maartenbreddels commented 2 years ago

Love it ❤️

NickCrews commented 2 years ago

Need me to pull in that fixup commit that addresses the pyarrow 6 failures in CI?

JovanVeljanoski commented 2 years ago

@NickCrews yes please - please rebase on master.

Looks nice tho. I will review and test over the weekend. Thanks!

JovanVeljanoski commented 2 years ago

Can you rebase on master again please (i think the first time I suggested this, the fix for the where was not merged yet, but i thought it was...)

NickCrews commented 2 years ago

OK, that push to 349a939 has all the basic suggestions. I can try the factory test fixture too if you point me at an example that uses it.

re: performance: @JovanVeljanoski do you get the same numbers when doing how="any"? Or are you saying that this performance should be looked at regardless of this change?

JovanVeljanoski commented 2 years ago

For the in-memory dataset, they are roughly the same,

For the gdelt dataset, how=any is 20-30s faster (consistently amongst 10)

For the taxi dataset how=any the performance is erratic.. sometimes it is even a bit slower than how=all. I don't meant to block this, just raising some concerns. Maybe it is a deeper issue, maybe I don't understand everything about whats happening.

Just wanted to make sure if this is "expected behavior". Perhaps it is a symptom of a deeper problem.

NickCrews commented 2 years ago

Thanks @JovanVeljanoski for those detailed tests! That's very perplexing to me, hopefully @maartenbreddels has all the answers :)

If there is a fixable cause of the slowdown, then I would say we should merge this and fix it later. The public API shouldn't change, so I don't see a danger of being backwards incompatible.

If the slowdown isn't fixable, it is just inherent in what we're doing, then it's less obvious what to do. Don't offer the option because it leads to slowness? But someone might just plain need to do this, so they're screwed. Just note that it can be slower in the docstring?

NickCrews commented 2 years ago

OK, see if that tweak is what we're looking for with the test factory

maartenbreddels commented 2 years ago

Awesome, thanks!