vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

Question: What does `dropna()` do? #2084

Closed abeburnett closed 2 years ago

abeburnett commented 2 years ago

When run against a whole dataframe (e.g., df.dropna()) what is the expected behavior? I would expect it to only drop those rows which are entirely populated by na/nan, but it seems like it may be dropping every row which an na/nan in any column.

I'd prefer to only drop rows which are completely na/nan in every column. How can I do this with vaex?

As a sidenote, the reason for this need is after importing 222 parquet files I ended up with a bunch of rows filled with na/nan in all columns, and also the same number of columns duplicated but with generic names and no data (filled with na/nan). E.g., columns like COL_1 and COL_2 filled with blanks.

Anyway, thanks in advance!

JovanVeljanoski commented 2 years ago

The behavior you describe is consistent with pandas, and is the intended behavior.

I don't know of any good way of doing this right now. Vaex currently does only column operations. You will probably have to construct a full on expression on whether you want to drop a certain row or not, and then do the dropping.

NickCrews commented 2 years ago

Looking at the code, it actually looks pretty easy to add this, so you can either do df.dropna(how="any") or df.dropna(how="all"), which is what pandas does. Also gives it to us for dropnan, dropinf, etc

It would just need to add a different case at https://github.com/vaexio/vaex/blob/master/packages/vaex-core/vaex/dataframe.py#L5070-L5075 so that instead of doing expression = expression | f(self[column]) we do expression = expression & f(self[column])

I could write a PR if this is desired @JovanVeljanoski ?

JovanVeljanoski commented 2 years ago

@NickCrews Sure, you can give it a shot! If you attempt to do this, you might also wanna do it for dropmissing and dropnan, for consistency between those. Thanks!