javascriptdata / danfojs

Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
https://danfo.jsdata.org/
MIT License
4.76k stars 209 forks source link

Handling NaNs when calculating mean or sum #200

Closed schajee closed 3 years ago

schajee commented 3 years ago

Describe the bug When calculating the mean() or sum() of a dataframe, NaNs are not ignored and output contains NaNs.

Issue #144 says that 0.2.0 onwards this behavior is addressed.

To Reproduce

  1. data = [[11, 20, 3], [null, 15, 6], [2, 30, 40], [2, 89, 78]]
  2. let df = new dfd.DataFrame(data)
  3. df.mean().print() or df.sum().print()
Current behavior Mean Sum
NaN NaN
38.5 154
31.75 127
Expected behavior Mean Sum
5 15
38.5 154
31.75 127

Desktop (please complete the following information):

schajee commented 3 years ago

Additionally, when I apply a custom function to the dataframe to filter out NaNs...

function mean_vals(x) {
    return x.dropna().mean()
}

df.apply({ axis: 1, callable: mean_vals })

I get...

Callable Error: You can only apply JavaScript functions on DataFrames when axis is not specified. This operation is applied on all element, and returns a DataFrame of the same shape.

Even though the same works without .dropna()

risenW commented 3 years ago

@schajee Thanks for raising this issue.

I just realized that I fixed the issue in the Series class only.

In the case of a DataFrame, there are some concerns. First, we are computing the mean on a DataFrame using Tensorflow.js (tfjs) .mean function. This .mean function and generally all tfjs arithmetic operations will return NaN if any field is NaN or undefined. This in turn affects the mathematical operation. For example:

const a = tf.tensor([ [ 11, 20, 3 ],
                      [ NaN, 15, 6 ],
                      [ 2, 30, 40 ],
                      [ 2, 89, 78 ]])
console.log(a)
const b = a.mean(axis=0)
console.log(b)
//outputs
Tensor
    [[11 , 20, 3 ],
     [NaN, 15, 6 ],
     [2  , 30, 40],
     [2  , 89, 78]]
Tensor
    [NaN, 38.5, 31.75]

Now if we decide to change all NaNs to null before calculating the mean, then tfjs internally sets all null values to 0. This will affect the calculation of averages like mean, where we divide by the total number of observations.

So for example if we do the following in tfjs:

const a = tf.tensor([ [ 11, 20, 3 ],
                      [ null, 15, 6 ],
                      [ 2, 30, 40 ],
                      [ 2, 89, 78 ]])
console.log(a)
const b = a.mean(axis=0)
console.log(b)
//outputs

Tensor
    [[11, 20, 3 ],
     [0 , 15, 6 ],
     [2 , 30, 40],
     [2 , 89, 78]]
Tensor
    [3.75, 38.5, 31.75]

So there are two options, we either go with the computing mean while counting missing observations or without counting missing observations.

In order to be consistent with Series implementation and Pandas API in general, we'll remove all NaNs before computation. If this isn't your desired result, then it is better to replace all missing values in a DF before calling the mean or sum operation.

PS: I'll start a fix for this.

risenW commented 3 years ago

FIXED IN https://github.com/opensource9ja/danfojs/pull/210