Missing value imputation

peterwittek / somoclu

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

https://peterwittek.github.io/somoclu/

MIT License

266 stars 69 forks source link

Missing value imputation #77

Closed Sean26 closed 7 years ago

Sean26 commented 7 years ago

Hi, I have a dataset with complete and missing data and want to use a SOM to estimate the missing values. In theory, I know how to do this, i.e., train the SOM on available data, then find BMUs for the missing datapoints and use the corresponding features to estimate the missing values (please correct me if I got that wrong). However, I'm pretty much a noob at this and have no idea how to best implement this with Somoclu in Python. Any help would be really appreciated: e.g., how does Somoclu deal with NaNs (will it use what is there or ignore that input vector entirely), and how do I use the bmus to find the corresponding data? Thanks a million!

(PS: Because I won't have a chance to say this otherwise, many thanks Peter Wittek for Somoclu.)

peterwittek commented 7 years ago

NaNs are not dealt with at all. I would replace the missing values with zeros before feeding them to a network, although it is certainly not ideal. Recall that the default random initialization fills the codebook with random values between 0 and 1, so unless you use a different initalization, scale your data accordingly.

I am not sure which interface you are using, but for instance in Python, if your trained map is in an object som, you could get the BMUs for your new vector x with this:

som.get_bmus(som.get_surface_state([x]))

Sean26 commented 7 years ago

Many thanks for your reply. With regard to the NaNs, I meant, if there are NaNs in the input vector, will the SOM still use the remaining values and do whatever it can with those or will it discard the entire input vector? And good point about the scaling.

I'm using Python, so your line of code was really useful. If I now have the bmus for my new vector, what would be the best way to get all other data vectors that also correspond to this bmu? Sorry, if that is something really obvious.

peterwittek commented 7 years ago

Somoclu does not discard anything, but the result might be garbage (the design is garbage in-garbage out). So the onus is yours to replace NaNs with a number.

The array som.bmus contains the BMU for each instance. So once you have identified the BMU you are interested in, scan this array to find the matching training instances (if any).

Sean26 commented 7 years ago

Thank you, I'll give that a try.

Yeah, I'm aware that this could go wrong, and we could be left with a lot of garbage, but I guess it's worth a try. There are quite a few papers out there where people successfully used SOMs to impute missing values.

peterwittek commented 7 years ago

I am not sure how they work, but this is what I would do:

Train an ESOM (#nodes>|dataset|).
Say, your new vector with the missing values is x. Replace the missing values with the average value for that dimension. Call this vector x'. The reason for averaging is that the average value is likely to be attracted approximately uniformly by all nodes. This way you only get relevant proximity information based on the non-missing values.
Get the BMU for x'. Use the entries of the weight vector of this BMU to fill in the missing values in x.

In other words, I would not use SOM as an associative memory, but rather as a kind of generative model.

Sean26 commented 7 years ago

Your approach gave me quite a bit to ponder about as it it quite different from the way I was thinking of using the SOMs. I would have used a SOM with #nodes<|dataset|, so I could average the results per node to estimate the missing values. Actually, it might be interesting to try both approaches. (If I can figure out how to, as I haven't been able to get back to this over the last few days).

peterwittek commented 7 years ago

At the end of the day, both are just forms of averaging. With an ESOM, you get a kind of nonlinear interpolation between data instances, so your results might improve.