scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
830 stars 85 forks source link

Row-wise weighted mean gives incorrect results #3285

Open nj-vs-vh opened 2 hours ago

nj-vs-vh commented 2 hours ago

Version of Awkward Array

2.6.9

Description and code to reproduce

I am doing a row-wise weighted mean in an awkward array and getting wrong results.

Here's an MRE with outputs in the comments:

import awkward as ak

data = ak.Array(
    [
        [1, 2, 3],
        [4, 5],
    ]
)

weight = ak.Array(
    [
        [1, 1, 2],
        [1, 10],
    ]
)

# manual row-by-row - expected results
print(ak.mean(data[0], weight=weight[0]))  # -> 2.25
print(ak.mean(data[1], weight=weight[1]))  # -> 4.909090909090909

# manual vectorized - expected results
weights_norm = weight / ak.sum(weight, axis=1)
print(ak.sum(weights_norm * data, axis=1))  # -> [2.25, 4.91]

# the most natural call I expected to work - incorrect result in the 2nd row
print(ak.mean(data, weight=weight, axis=1))  # -> [2.25, 13.5]
nj-vs-vh commented 2 hours ago

Also, it might be interesting that on a large dataset the manual vectorized operation (normalize weights by their row-wise sum, multiply data by them and sum row-wise) is much faster compared with ak.mean(data, weight, axis=1). For my dataset the latter is ~10 sec and the former is <0.1 sec.