Open Midnighter opened 10 years ago
What is the numpy/scipy convention for np.nan
meaning 'missing data'? I'd thought masked arrays were intended to be used for this instead.
pdist
is definitely not ma.array
aware and apart from passing in a pre-masked array into a function I'm not sure how many other parts of scipy are able to handle masked arrays.
The other scipy functions that treat np.nan
as a missing value include nan
in the function name, like nanmean
or nanmedian
or nanstd
. Maybe the new nan-aware pdist function could be nanpdist
or something? Here's another discussion https://github.com/scipy/scipy/issues/2868 involving confusion caused by the various semantics of nan. Similarly, MATLAB seems to not always treat nan as a missing value but has a separate toolbox http://pub.ist.ac.at/~schloegl/matlab/NaN/ dedicated to this interpretation.
@argriffing: There is no convention that nan
means "missing", except for those functions explicitly written to ignore nan
(e.g. np.nansum
, np.nanmax
, etc).
I guess an option in pdist
/cdist
to ignore points containing nan
would be OK (I don't have a strong opinion either way), but I don't think it should ignore them by default.
I would definitely go with a corresponding function nanpdist
(wouldn't it be more PEP8 compliant to say nan_mean
, nan_pdist
, etc.?). So I take it there is an interest in having nanpdist
and nancdist
to live within scipy.spatial?
So I take it there is an interest in having nanpdist and nancdist to live within scipy.spatial?
Yes that notation would seem to most closely match the current scipy conventions. By the way, I suspect that whatever succeeds numpy will have built-in NA for missing values distinct from NaN and also covering most use cases for masked arrays.
So I take it there is an interest in having nanpdist and nancdist to live within scipy.spatial?
Bring it up on the mailing list for discussion. I don't like the idea of a proliferation of nan*
functions. Where does it lead? scipy.signal.nan_decimate
, scipy.stats.nan_probplot
, etc.?
Yes that notation would seem to most closely match the current scipy conventions. By the way, I suspect that whatever succeeds numpy will have built-in NA for missing values distinct from NaN and also covering most use cases for masked arrays.
And that is to come soon(:tm:)? :stuck_out_tongue:
Bring it up on the mailing list for discussion. I don't like the idea of a proliferation of nan* functions. Where does it lead? scipy.signal.nan_decimate, scipy.stats.nan_probplot, etc.?
Sure thing, any particular title under which you would like to have this discussion? I mean, do you want to have a general discussion of missing values in scipy or one in particular about pdist?
And that is to come soon(:tm:)?
It probably already exists, in the same way that numarray existed alongside numeric for a while.
Re: mailing list subject: Focus on the immediate issue: an API for pdist functionality that ignores points containing nan
. Past history shows that a general discussion of missing values will go on for days (weeks?), and probably won't resolve anything.
I see three obvious options:
nanpdist
pdist
that determines how nan
should be treated.OK, will do. With regard to your first point: I didn't see how that is possible. pdist
expects a matrix where rows are the vectors compared in pairs. So an element of any of the vectors might be missing and should be ignored when compared to an element of another vector at the same position but by no means should the whole column be removed from the matrix.
I don't use scipy distance functions so I don't really know what I'm talking about, but does it really make sense to just leave out missing values from the distance calculation? Would this not bias towards smaller distances for observations with more missing values? I know that you said that only a small number of entries are missing, but for example if all entries were missing for an observation, the hypothetical nanpdist
would compute its distance to all other vectors as zero? Are there not other imputation schemes, like using the average value, that would make more sense?
Imputing the missing values would, of course, be possible. The way to deal with the distance is applying a simple factor to the computed distance. If n
is the number of elements and i
the number of ignored pairs due to missing values, the distance is multiplied by (n / (n - i))
which should scale it up adequately.
Bring it up on the mailing list for discussion.
Here's @Midnighter's discussion link http://mail.scipy.org/pipermail/scipy-dev/2014-August/020001.html
The way to deal with the distance is applying a simple factor to the computed distance.
I could see how that makes sense (except that it divides by zero when a vector pair cannot agree on any subset of dimensions) but the same concern was raised separately on the mailing list so this would need to go in the docs. Are you making a PR?
except that it divides by zero when a vector pair cannot agree on any subset of dimensions
Jaime on the list suggested passing in a mask to the where
argument, so
if where.all(): # or the negation thereof, not sure about the definition right now
return np.nan
easily circumvents a ZeroDivisionError
and saves computation time.
As for a PR, it's not clear to me yet whether this is desired. I'll see if more answers pop up on the mailing list over the weekend.
Is there any update on this issue? I have a dataset that contains NaN's in each row. Calculating the distances via pdist ends up all NaNs. Also, what is the advantage of using pdist to other explicit distance calculation functions under scipy.distance. Is pdist providing any auto-paralellization?
I have some code lying around where I implemented Euclidean norm with missing values in C/Cython/Python. The thread on the mailing list died so I never bothered to make a PR but if you're interested I can put what I have on github.
That would be great. Dealing with NAN's seems always be a problem in scipy.distance. I encounter a similar issue in KDTree: http://stackoverflow.com/questions/36585998/kdtree-with-masked-arrays
To your question about pdist
, it is simply a convenient entry point for all of the distance measures.
So I have a notebook where I played with a few different versions here. I also found an old package structure which I can make available tomorrow or later tonight. That uses actual C code and takes the bottleneck approach to identify nan
s. I also remember thinking about using numpy's ufunc system. I didn't make any progress on that end, though.
@gsever: Sorry for the delay. I have put everything I had into a semi-sensible library package. I still need to set up some configuration for it (Travis, etc.) but you can start looking at the code. The functions can still be further optimized but I think it makes more sense to think about how the ufunc system can be leveraged such that all the different metrics can be implemented with ease.
Using _chk_weights
with nan_screen=True
from https://github.com/scipy/scipy/pull/6931, and then passing the resultant weight vectors (which would be 0 for each nan
) to the underlying functions would fix this.
@quantology , could you please be more specific about using _chk_weights
with nan_screen=True
. For instance, I have searched scipy.spatial.distance.pdist and guess they might be in **kwargs
? But I cannot go a step further as I only found w
kwarg like in scipy.spatial.distance.correlation.
Anyway, I suppose defining a lambda function can also solve the missing data problem of some distance calculation metric, say the following.
from scipy.spatial import distance
import numpy as np
coords = [(35.0456, np.nan),
(35.1174, -89.9711),
(35.9728, -83.9422),
(36.1667, -86.7833)]
distance.cdist(coords, coords, lambda u, v: np.sqrt(np.nansum((u-v)**2)))
# or the following scheme
distance.cdist(coords, coords, lambda u, v: np.sqrt(np.sum((u[(~np.isnan(u)) & (~np.isnan(v))]-v[(~np.isnan(u)) & (~np.isnan(v))])**2)))
At the moment
pdist
returns a distance matrix with anan
-entry whenever a vector with anynan
-element is part of the respective pair. This is consistent with, for example, the Rdist
function, as well as MATLAB, I believe.In my case, and I should think a few others' as well, there are very few
nan
s in a high-dimensional space. I would thus only ignore the corresponding pairs of elements of the vectors. I have solutions for this problem both in Cython and directly in C, which mirrors the current scipy code. Is there a general interest in having this in scipy?If the answer is yes, I will submit a pull request and then we just have to discuss the use of Cython or C or a mix of both.