Closed schajee closed 3 years ago
Additionally, when I apply a custom function to the dataframe to filter out NaNs...
function mean_vals(x) {
return x.dropna().mean()
}
df.apply({ axis: 1, callable: mean_vals })
I get...
Callable Error: You can only apply JavaScript functions on DataFrames when axis is not specified. This operation is applied on all element, and returns a DataFrame of the same shape.
Even though the same works without .dropna()
@schajee Thanks for raising this issue.
I just realized that I fixed the issue in the Series class only.
In the case of a DataFrame, there are some concerns.
First, we are computing the mean on a DataFrame using Tensorflow.js (tfjs) .mean
function. This .mean
function and generally all tfjs arithmetic operations will return NaN if any field is NaN or undefined. This in turn affects the mathematical operation.
For example:
const a = tf.tensor([ [ 11, 20, 3 ],
[ NaN, 15, 6 ],
[ 2, 30, 40 ],
[ 2, 89, 78 ]])
console.log(a)
const b = a.mean(axis=0)
console.log(b)
//outputs
Tensor
[[11 , 20, 3 ],
[NaN, 15, 6 ],
[2 , 30, 40],
[2 , 89, 78]]
Tensor
[NaN, 38.5, 31.75]
Now if we decide to change all NaNs to null before calculating the mean, then tfjs internally sets all null values to 0. This will affect the calculation of averages like mean, where we divide by the total number of observations.
So for example if we do the following in tfjs:
const a = tf.tensor([ [ 11, 20, 3 ],
[ null, 15, 6 ],
[ 2, 30, 40 ],
[ 2, 89, 78 ]])
console.log(a)
const b = a.mean(axis=0)
console.log(b)
//outputs
Tensor
[[11, 20, 3 ],
[0 , 15, 6 ],
[2 , 30, 40],
[2 , 89, 78]]
Tensor
[3.75, 38.5, 31.75]
So there are two options, we either go with the computing mean while counting missing observations or without counting missing observations.
In order to be consistent with Series implementation and Pandas API in general, we'll remove all NaNs before computation. If this isn't your desired result, then it is better to replace all missing values in a DF before calling the mean or sum operation.
PS: I'll start a fix for this.
Describe the bug When calculating the mean() or sum() of a dataframe, NaNs are not ignored and output contains NaNs.
Issue #144 says that 0.2.0 onwards this behavior is addressed.
To Reproduce
data = [[11, 20, 3], [null, 15, 6], [2, 30, 40], [2, 89, 78]]
let df = new dfd.DataFrame(data)
df.mean().print()
ordf.sum().print()
Desktop (please complete the following information):