javascriptdata / danfojs

Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
https://danfo.jsdata.org/
MIT License
4.8k stars 209 forks source link

NaN values don't get skipped on groupby aggregations. #421

Open kitsunde opened 2 years ago

kitsunde commented 2 years ago

Hello, maybe I'm missing something but groupby aggregations behave inconsistently with DataFrame aggregations, which is a challenge since setting NaN values to a default like 0 behaves differently depending on the aggregations.

I will explore using .apply to work around this as mentioned in https://github.com/javascriptdata/danfojs/issues/187#issuecomment-827531989

To Reproduce

const testDf = new dfd.DataFrame({ group: ['A', 'A'], val1: [1, NaN], val2: [2, 3] });
testDf.groupby(['group']).mean().print();

results in val1_mean being NaN. I.e.

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ group             │ val1_mean         │ val2_mean         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ A                 │ NaN               │ 2.5               ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Expected behavior

To aggregate like:

testDf.loc({ columns: ['val1', 'val2'] }).mean({ axis: 0 }).print();

i.e.

╔══════╤═════╗
║ val1 │ 1   ║
╟──────┼─────╢
║ val2 │ 2.5 ║
╚══════╧═════╝
igonro commented 2 years ago

The current behaviour I think is fine. Pandas does give an dropna option. It could be implemented, so the behaviour could be extended to this:

testDf.groupby(['group'], { dropna: true }).mean().print();
// ╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
// ║            │ group             │ val1_mean         │ val2_mean         ║
// ╟────────────┼───────────────────┼───────────────────┼───────────────────╢
// ║ 0          │ A                 │ 1.0               │ 2.5               ║
// ╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

I don't know is something like that would be what you need.

Anyway, why do you say that groupby aggregations behave inconsistently? Can you give another example with a inconsistent result?

kitsunde commented 2 years ago

Ah sorry, I copy-pasted my code poorly into the expected example, I've updated it. What I mean is that.

testDf.loc({ columns: ['val1', 'val2'] }).mean({ axis: 0 }).print();

Will ignore NaN but

testDf.groupby(['group']).mean().print();

will not. The challenge is that if I need to do something like:

agg({
  installs: ['sum'],
  value: ['mean', 'sum']
});

Different columns can have NaN on different rows, and it seems like dropna() will drop rows or columns if there's a NaN in any position, while to be able to use agg it would need to cover the values only on the column being aggregated.

The pandas option seem different from what I'm suggesting. That will drop columns and rows, but I'm just specifically talking about the aggregation behaviour of NaN values. If installs is NaN, but value is not. I would still want to aggregate on the value of that row.

Does this make sense?

kitsunde commented 2 years ago

In pandas:

import pandas as pd

d = {'group': ['A', 'A'], 'col1': [1, None], 'col2': [2, 3]}
df = pd.DataFrame(data=d)
df.groupby(['group']).mean()
df.groupby(['group']).agg({
  'col1': ['mean'],
  'col2': ['mean']
})

Both mean and agg here skips None values by default.

Repl: https://replit.com/@kitsunde/pandas-none-handling-in-groupby

GauravB159 commented 3 months ago

@kitsunde Did you figure out a way around this? I'm facing the same issue.