Closed Sean26 closed 7 years ago
NaNs are not dealt with at all. I would replace the missing values with zeros before feeding them to a network, although it is certainly not ideal. Recall that the default random initialization fills the codebook with random values between 0 and 1, so unless you use a different initalization, scale your data accordingly.
I am not sure which interface you are using, but for instance in Python, if your trained map is in an object som
, you could get the BMUs for your new vector x
with this:
som.get_bmus(som.get_surface_state([x]))
Many thanks for your reply. With regard to the NaNs, I meant, if there are NaNs in the input vector, will the SOM still use the remaining values and do whatever it can with those or will it discard the entire input vector? And good point about the scaling.
I'm using Python, so your line of code was really useful. If I now have the bmus for my new vector, what would be the best way to get all other data vectors that also correspond to this bmu? Sorry, if that is something really obvious.
Somoclu does not discard anything, but the result might be garbage (the design is garbage in-garbage out). So the onus is yours to replace NaNs with a number.
The array som.bmus
contains the BMU for each instance. So once you have identified the BMU you are interested in, scan this array to find the matching training instances (if any).
Thank you, I'll give that a try.
Yeah, I'm aware that this could go wrong, and we could be left with a lot of garbage, but I guess it's worth a try. There are quite a few papers out there where people successfully used SOMs to impute missing values.
I am not sure how they work, but this is what I would do:
In other words, I would not use SOM as an associative memory, but rather as a kind of generative model.
Your approach gave me quite a bit to ponder about as it it quite different from the way I was thinking of using the SOMs. I would have used a SOM with #nodes<|dataset|, so I could average the results per node to estimate the missing values. Actually, it might be interesting to try both approaches. (If I can figure out how to, as I haven't been able to get back to this over the last few days).
At the end of the day, both are just forms of averaging. With an ESOM, you get a kind of nonlinear interpolation between data instances, so your results might improve.
Hi, I have a dataset with complete and missing data and want to use a SOM to estimate the missing values. In theory, I know how to do this, i.e., train the SOM on available data, then find BMUs for the missing datapoints and use the corresponding features to estimate the missing values (please correct me if I got that wrong). However, I'm pretty much a noob at this and have no idea how to best implement this with Somoclu in Python. Any help would be really appreciated: e.g., how does Somoclu deal with NaNs (will it use what is there or ignore that input vector entirely), and how do I use the bmus to find the corresponding data? Thanks a million!
(PS: Because I won't have a chance to say this otherwise, many thanks Peter Wittek for Somoclu.)