Handling Missing Data - Githubissues

Omarito2412 commented 7 years ago

Hello Peter,

I was wondering if I could use Pygmmis to handle missing data in a dataset, for example I've constructed a dummy dataset that has a few missing features and I'm trying to use Pygmmis to estimate them but I guess I can't find how.

array([[ 1., 5.], [ 2., 5.], [ 3., 5.], [ 4., 5.], [ 5., 5.], [ 6., nan], [ nan, 5.]], dtype=float32) I'm trying to follow the example in the README or the test script.

Thank you.

pmelchior commented 7 years ago

Missing data are less ill-defined as truncated data, but both can be solved by replacing what isn't observed with a prediction from the current model. So, in principle that is doable. I haven't coded this up yet, though. If you need this functionality I suggest you try Jo Bovy's Extreme Deconvolution, where that's an option: https://github.com/jobovy/extreme-deconvolution

Omarito2412 commented 7 years ago

Thanks Peter, I appreciate your help.

pmelchior commented 7 years ago

Missing data is an important case, so I looked into it again, and it is not hard to extend the current code to cover this case.

There is one problem, though: if there are missing and truncated data, the user has to specify a missingness mechanism for the imputation data! In other words, the data created internally to treat truncation have to exhibit the same missingness as the observed data.

This is exactly equivalent to the effect of noise: the imputation data need to be as noisy as the observed data (this is what the covar_callback argument provides).

In the short run, we could raise a NotImplementedError if data are missing and truncated, but the better option is to use the same callback mechanism to create missingness for the imputation sample.

pmelchior commented 7 years ago

@Omarito2412 I realized that there is a simple solution for your particular request: create a covariance matrix for the data and set the elements for the missing features to a very large number. That effectively sets those features' weights to zero. I've included this in the pygmmis code so you should be able to run your example data without any change to data or API call.

While I was at it, I also implemented the functionality for arbitrary rotation matrices R, so all of the functionality of Extreme Deconvolution is now in pygmmis.

Omarito2412 commented 7 years ago

@pmelchior Thanks for your help Peter!

Ravisik commented 3 years ago

Hi,

I just found this topic, which is very interesting. I am currently trying to impute missing values using GMM and I'm not sure to understand your reply :

@Omarito2412 I realized that there is a simple solution for your particular request: create a covariance matrix for the data and set the elements for the missing features to a very large number. That effectively sets those features' weights to zero. I've included this in the pygmmis code so you should be able to run your example data without any change to data or API call.

While I was at it, I also implemented the functionality for arbitrary rotation matrices R, so all of the functionality of Extreme Deconvolution is now in pygmmis.

Could you provide more information for the imputation of features values?

For now, one workaround that seems to work (but not really optimized): Let's say you have fitted your GMM and want to impute the instance A=[0.5., nan]. You can create a test matrix = [0.5., 0], [0.5., 0.1], .... [0.5, 1.]] and find where the probability is highest. For instance A=[0.5,0.4]

I imagine that there is a better way to do it ?

Thanks a lot in advance !

pmelchior commented 3 years ago

The treatment I proposed above is the following. Specify the covariance matrices for the error in all of your samples. If you don't have errors, simply set them to a constant diagonal, e.g. covar=np.eye(2). For your instance A=(0.5, nan), use this covariance matrix instead: covar = ((1,0), (0, 1e10)) or even covar = ((1,0), (0, np.inf)). This way you tell the fitter that you have no information for the missing value.

pmelchior / pygmmis

Handling Missing Data #3