vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

Inconsistent and misleading binby results for empty bins in minmax #1673

Open schwingkopf opened 2 years ago

schwingkopf commented 2 years ago

Description The output values for empty bins in calls to minmax are inconsistent for different data types. Additionally they are misleading as not distinguishable from non-empty bin values of same value. This is most critical for uint8/16/32 and int8/16, where 0 is returned for empty bins.

Example code:

import numpy as np
import vaex

dtypes = ["uint8", "int8", "uint16", "int16", "uint32", "int32", "uint64", "int64", "float16", "float32", "float64"]
test_data = {str(dtype): np.arange(0, 5, dtype=dtype) for dtype in dtypes}
df = vaex.from_dict(test_data)

print(df)
for t in dtypes:
    minmax = df.minmax(t, "uint8", [0, 6], 7)
    print(f"Empty bin value of MinMax for {t}: {minmax[-1]}")

Ouput:

#    uint8    int8    uint16    int16    uint32    int32    uint64    int64    float16    float32    float64
  0        0       0         0        0         0        0         0        0          0          0          0
  1        1       1         1        1         1        1         1        1          1          1          1
  2        2       2         2        2         2        2         2        2          2          2          2
  3        3       3         3        3         3        3         3        3          3          3          3
  4        4       4         4        4         4        4         4        4          4          4          4

Empty bin value of MinMax for uint8: [0 0]
Empty bin value of MinMax for int8: [0 0]
Empty bin value of MinMax for uint16: [0 0]
Empty bin value of MinMax for int16: [0 0]
Empty bin value of MinMax for uint32: [0 0]
Empty bin value of MinMax for int32: [-2147483648 -2147483648]
Empty bin value of MinMax for uint64: [9223372036854775808 9223372036854775808]
Empty bin value of MinMax for int64: [-9223372036854775808 -9223372036854775808]
Empty bin value of MinMax for float16: [ inf -inf]
Empty bin value of MinMax for float32: [ inf -inf]
Empty bin value of MinMax for float64: [ inf -inf]

Float types: For float types the minimum is +inf and the max is -inf, maybe this should be the other way around? Nevertheless I'd find np.NaN more appropriate than -/+inf.

Integer types: Quite messy here: For integer types I understand that np.NaN is not available. For unsigned types there is inconsistency: uint8/16/32 return 0, while uint64 returns RANGE_MAX For signed types there is inconsistency: int8/16 return 0, while int32/64 return RANGE_MIN

General: From the result of minmax the user actually has no way to tell the difference between an empty bin and a bin truly having the "default empty bin return" value. Proposed ways to improve this are:

  1. Return a python list instead of numpy array and return empty bins as empty lists or filled with None (maybe the output format can be selected by function argument)
  2. Provide a way to get information about empty bins, e.g. returning an additional list with indices of empty bins
  3. Return a masked numpy array

Software information

maartenbreddels commented 2 years ago

Hi,

Thanks for the detailed report. The initial values should be [max value for type, min value for type], such that we can successively apply min and max on the current value and incoming new data. So, assuming this should be the behaviour, this seems like a bug. I think what we could do in the future is keep track of the counts, and return a masked array when data is missing. For compatibility we could make this an opt in, and turn that into an opt-out for a next major release. Does that sound reasonable to you

cheers,

Maarten

duplicate of https://github.com/vaexio/vaex/issues/1422

schwingkopf commented 2 years ago

Thanks for looking into this! Your solution sounds reasonable to me.

Other improvements/feature requests in this respect may be:

I think both features have only few use cases, but if they're easy to implement could be worth doing.