helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
210 stars 53 forks source link

Add clarification to documentation of unique() #564

Open Inzlinger opened 4 years ago

Inzlinger commented 4 years ago

Feature functionality Clarify that the order in res is either unstable or how it relates to split/axis. Not sure whether this is a bug or works as intended; the order of res is different depending on the split.

https://github.com/helmholtz-analytics/heat/blob/a1c3792e6e32b3bf7ca4953b6afbc1d53759a031/heat/core/manipulations.py#L1319

Additional context Example:

import heat as ht a = ht.array([1,2,3]) a tensor([1, 2, 3]) ht.unique(a) tensor([3, 2, 1]) b = ht.array([1,2,3], split=0) b tensor([1, 2, 3]) ht.unique(b) tensor([1, 2, 3])

ClaudiaComito commented 4 years ago

Hi Jakob,

nice catch! I don't think this was intended. What happens is, while in the non-distributed case ht.unique(a) = torch.unique(a, sorted=False), for the distributed case we're using a different algorithm. It gets even more interesting:

>>> a = torch.tensor([1, 0, 4, 8, 2, 2, 3, 0, 7, 9, 9])
>>> b = ht.array(a, split=0)
>>> c = ht.array(a)
>>> ht.unique(b)
tensor([1, 0, 4, 8, 3, 2, 7, 9])
>>> ht.unique(c) # same as torch.unique(a, sorted=False)
tensor([9, 7, 3, 2, 8, 4, 0, 1])

At the moment, the way to get consistent results is to always specify sorted=True.

In fact, both numpy and torch return a sorted unique array/tensor by default, so the quick fix for consistency would be to set the default value of sorted to True for heat as well. I suppose it was set to False because, in the distributed case, sorting the unique tensor comes with significant overhead. Maybe @TheSlimvReal wants to chime in.

ClaudiaComito commented 1 year ago

unique has been reworked and this issue will be fixed with #749. Sorry it's taking so long.


Reviewed within #1109