Open schwingkopf opened 2 years ago
Hi,
Thanks for the detailed report. The initial values should be [max value for type, min value for type], such that we can successively apply min and max on the current value and incoming new data. So, assuming this should be the behaviour, this seems like a bug. I think what we could do in the future is keep track of the counts, and return a masked array when data is missing. For compatibility we could make this an opt in, and turn that into an opt-out for a next major release. Does that sound reasonable to you
cheers,
Maarten
duplicate of https://github.com/vaexio/vaex/issues/1422
Thanks for looking into this! Your solution sounds reasonable to me.
Other improvements/feature requests in this respect may be:
limits='minmax'
, where start/stop is determined internally by running minmax
on the full binby-expression. As this information does not propagate to the user, one would have to manually call minmax
again for the binby-expression (or can both be done in a single iteration using delay=True
?). Could be helpful to return an additional array with the bin edges, if desired.start
and right to stop
to dump the min/max for data points outside of [start, stop].I think both features have only few use cases, but if they're easy to implement could be worth doing.
Description The output values for empty bins in calls to minmax are inconsistent for different data types. Additionally they are misleading as not distinguishable from non-empty bin values of same value. This is most critical for uint8/16/32 and int8/16, where 0 is returned for empty bins.
Example code:
Ouput:
Float types: For float types the minimum is +inf and the max is -inf, maybe this should be the other way around? Nevertheless I'd find np.NaN more appropriate than -/+inf.
Integer types: Quite messy here: For integer types I understand that np.NaN is not available. For unsigned types there is inconsistency: uint8/16/32 return 0, while uint64 returns RANGE_MAX For signed types there is inconsistency: int8/16 return 0, while int32/64 return RANGE_MIN
General: From the result of minmax the user actually has no way to tell the difference between an empty bin and a bin truly having the "default empty bin return" value. Proposed ways to improve this are:
None
(maybe the output format can be selected by function argument)Software information
import vaex; vaex.__version__)
: {'vaex': '4.4.0', 'vaex-core': '4.4.0', 'vaex-viz': '0.5.0', 'vaex-hdf5': '0.9.0', 'vaex-server': '0.6.0', 'vaex-astro': '0.8.3', 'vaex-jupyter': '0.6.0', 'vaex-ml': '0.13.0'}