scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
13.11k stars 5.19k forks source link

pdist/cdist with missing values #3870

Open Midnighter opened 10 years ago

Midnighter commented 10 years ago

At the moment pdist returns a distance matrix with a nan-entry whenever a vector with any nan-element is part of the respective pair. This is consistent with, for example, the R dist function, as well as MATLAB, I believe.

In my case, and I should think a few others' as well, there are very few nans in a high-dimensional space. I would thus only ignore the corresponding pairs of elements of the vectors. I have solutions for this problem both in Cython and directly in C, which mirrors the current scipy code. Is there a general interest in having this in scipy?

If the answer is yes, I will submit a pull request and then we just have to discuss the use of Cython or C or a mix of both.

argriffing commented 10 years ago

What is the numpy/scipy convention for np.nan meaning 'missing data'? I'd thought masked arrays were intended to be used for this instead.

Midnighter commented 10 years ago

pdist is definitely not ma.array aware and apart from passing in a pre-masked array into a function I'm not sure how many other parts of scipy are able to handle masked arrays.

argriffing commented 10 years ago

The other scipy functions that treat np.nan as a missing value include nan in the function name, like nanmean or nanmedian or nanstd. Maybe the new nan-aware pdist function could be nanpdist or something? Here's another discussion https://github.com/scipy/scipy/issues/2868 involving confusion caused by the various semantics of nan. Similarly, MATLAB seems to not always treat nan as a missing value but has a separate toolbox http://pub.ist.ac.at/~schloegl/matlab/NaN/ dedicated to this interpretation.

WarrenWeckesser commented 10 years ago

@argriffing: There is no convention that nan means "missing", except for those functions explicitly written to ignore nan (e.g. np.nansum, np.nanmax, etc).

I guess an option in pdist/cdist to ignore points containing nan would be OK (I don't have a strong opinion either way), but I don't think it should ignore them by default.

Midnighter commented 10 years ago

I would definitely go with a corresponding function nanpdist (wouldn't it be more PEP8 compliant to say nan_mean, nan_pdist, etc.?). So I take it there is an interest in having nanpdist and nancdist to live within scipy.spatial?

argriffing commented 10 years ago

So I take it there is an interest in having nanpdist and nancdist to live within scipy.spatial?

Yes that notation would seem to most closely match the current scipy conventions. By the way, I suspect that whatever succeeds numpy will have built-in NA for missing values distinct from NaN and also covering most use cases for masked arrays.

WarrenWeckesser commented 10 years ago

So I take it there is an interest in having nanpdist and nancdist to live within scipy.spatial?

Bring it up on the mailing list for discussion. I don't like the idea of a proliferation of nan* functions. Where does it lead? scipy.signal.nan_decimate, scipy.stats.nan_probplot, etc.?

Midnighter commented 10 years ago

Yes that notation would seem to most closely match the current scipy conventions. By the way, I suspect that whatever succeeds numpy will have built-in NA for missing values distinct from NaN and also covering most use cases for masked arrays.

And that is to come soon(:tm:)? :stuck_out_tongue:

Bring it up on the mailing list for discussion. I don't like the idea of a proliferation of nan* functions. Where does it lead? scipy.signal.nan_decimate, scipy.stats.nan_probplot, etc.?

Sure thing, any particular title under which you would like to have this discussion? I mean, do you want to have a general discussion of missing values in scipy or one in particular about pdist?

argriffing commented 10 years ago

And that is to come soon(:tm:)?

It probably already exists, in the same way that numarray existed alongside numeric for a while.

WarrenWeckesser commented 10 years ago

Re: mailing list subject: Focus on the immediate issue: an API for pdist functionality that ignores points containing nan. Past history shows that a general discussion of missing values will go on for days (weeks?), and probably won't resolve anything.

I see three obvious options:

  1. Don't change anything--the users should clean up their data!
  2. nanpdist
  3. Add a keyword argument to pdist that determines how nan should be treated.
Midnighter commented 10 years ago

OK, will do. With regard to your first point: I didn't see how that is possible. pdist expects a matrix where rows are the vectors compared in pairs. So an element of any of the vectors might be missing and should be ignored when compared to an element of another vector at the same position but by no means should the whole column be removed from the matrix.

argriffing commented 10 years ago

I don't use scipy distance functions so I don't really know what I'm talking about, but does it really make sense to just leave out missing values from the distance calculation? Would this not bias towards smaller distances for observations with more missing values? I know that you said that only a small number of entries are missing, but for example if all entries were missing for an observation, the hypothetical nanpdist would compute its distance to all other vectors as zero? Are there not other imputation schemes, like using the average value, that would make more sense?

Midnighter commented 10 years ago

Imputing the missing values would, of course, be possible. The way to deal with the distance is applying a simple factor to the computed distance. If n is the number of elements and i the number of ignored pairs due to missing values, the distance is multiplied by (n / (n - i)) which should scale it up adequately.

argriffing commented 10 years ago

Bring it up on the mailing list for discussion.

Here's @Midnighter's discussion link http://mail.scipy.org/pipermail/scipy-dev/2014-August/020001.html

argriffing commented 10 years ago

The way to deal with the distance is applying a simple factor to the computed distance.

I could see how that makes sense (except that it divides by zero when a vector pair cannot agree on any subset of dimensions) but the same concern was raised separately on the mailing list so this would need to go in the docs. Are you making a PR?

Midnighter commented 10 years ago

except that it divides by zero when a vector pair cannot agree on any subset of dimensions

Jaime on the list suggested passing in a mask to the where argument, so

if where.all(): # or the negation thereof, not sure about the definition right now
    return np.nan

easily circumvents a ZeroDivisionError and saves computation time.

As for a PR, it's not clear to me yet whether this is desired. I'll see if more answers pop up on the mailing list over the weekend.

gsever commented 8 years ago

Is there any update on this issue? I have a dataset that contains NaN's in each row. Calculating the distances via pdist ends up all NaNs. Also, what is the advantage of using pdist to other explicit distance calculation functions under scipy.distance. Is pdist providing any auto-paralellization?

Midnighter commented 8 years ago

I have some code lying around where I implemented Euclidean norm with missing values in C/Cython/Python. The thread on the mailing list died so I never bothered to make a PR but if you're interested I can put what I have on github.

gsever commented 8 years ago

That would be great. Dealing with NAN's seems always be a problem in scipy.distance. I encounter a similar issue in KDTree: http://stackoverflow.com/questions/36585998/kdtree-with-masked-arrays

Midnighter commented 8 years ago

To your question about pdist, it is simply a convenient entry point for all of the distance measures.

So I have a notebook where I played with a few different versions here. I also found an old package structure which I can make available tomorrow or later tonight. That uses actual C code and takes the bottleneck approach to identify nans. I also remember thinking about using numpy's ufunc system. I didn't make any progress on that end, though.

Midnighter commented 8 years ago

@gsever: Sorry for the delay. I have put everything I had into a semi-sensible library package. I still need to set up some configuration for it (Travis, etc.) but you can start looking at the code. The functions can still be further optimized but I think it makes more sense to think about how the ufunc system can be leveraged such that all the different metrics can be implemented with ease.

quantology commented 7 years ago

Using _chk_weights with nan_screen=True from https://github.com/scipy/scipy/pull/6931, and then passing the resultant weight vectors (which would be 0 for each nan) to the underlying functions would fix this.

FeiYao-Edinburgh commented 4 years ago

@quantology , could you please be more specific about using _chk_weights with nan_screen=True. For instance, I have searched scipy.spatial.distance.pdist and guess they might be in **kwargs? But I cannot go a step further as I only found w kwarg like in scipy.spatial.distance.correlation.

Anyway, I suppose defining a lambda function can also solve the missing data problem of some distance calculation metric, say the following.

from scipy.spatial import distance
import numpy as np
coords = [(35.0456, np.nan),
          (35.1174, -89.9711),
          (35.9728, -83.9422),
          (36.1667, -86.7833)]
distance.cdist(coords, coords, lambda u, v: np.sqrt(np.nansum((u-v)**2)))
# or the following scheme
distance.cdist(coords, coords, lambda u, v: np.sqrt(np.sum((u[(~np.isnan(u)) & (~np.isnan(v))]-v[(~np.isnan(u)) & (~np.isnan(v))])**2)))