POPSOM is a Python library for dealing with population-base Self-Organizing Maps. This work was derived from R-based POPSOM which developed and maintained by Dr. Lutz Hamel and his former students.
This version of popsom is based on the R implementation of popsom version 4.x. Furthermore it is missing key functionality compared to the R implementation, namely, the native implementation of the VSOM training algorithm based on ideas from tensor algebra.
$ pip install -r requirements.txt
We have 13 different kinds of animals with 13 different features.
dove | hen | duck | owl | eagle | dog | wolf | cat | tiger | lion | horse | cow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Small | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Medium | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Big | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
2 legs | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 legs | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
hair | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Hooves | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Mane | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
Feathers | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Hunt | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |
Run | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
Fly | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Swim | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Load popsom, pandas and sklearn libraries.
import popsom as som
import pandas as pd
from sklearn import datasets
If you got following error message: ImportError: cannot import name 'datetools' You need to re-install the datatools like this
pip3 uninstall statsmodels
pip3 install numpy scipy pandas
pip3 install statsmodels
Prepare the data for training.
animal = ['dove','hen','duck','owl','eagle','fox','dog','wolf','cat','tiger','lion','horse','cow']
attribute = [[1,0,0,1,0,0,0,0,1,0,0,1,0],
[1,0,0,1,0,0,0,0,1,0,0,0,0],
[1,0,0,1,0,0,0,0,1,0,0,1,1],
[1,0,0,1,0,0,0,0,1,1,0,1,0],
[0,1,0,1,0,0,0,0,1,1,0,0,0],
[0,1,0,1,0,0,0,0,1,1,0,0,0],
[0,1,0,0,1,1,0,0,0,0,1,0,0],
[0,1,0,0,1,1,0,1,0,1,1,0,0],
[1,0,0,0,1,1,0,0,0,1,0,0,0],
[0,0,1,0,1,1,0,0,0,1,1,0,0],
[0,0,1,0,1,1,0,1,0,1,1,0,0],
[0,0,1,0,1,1,1,1,0,0,1,0,0],
[0,0,1,0,1,1,1,0,0,0,0,0,0]]
attr = pd.DataFrame(attribute)
attr.columns = ['small','medium','big','2 legs','4 legs','hair','hooves','mane','feathers','hunt','run','fly','swim']
Initialize the model.
m = som.map(xdim=10,ydim=5)
Train the data.
m.fit(attr,animal)
Compute and display the starburst representation of clusters
m.starburst()
1. Prepare the iris data for training.
iris = datasets.load_iris()
labels = iris.target
data = pd.DataFrame(iris.data[:, :4])
data.columns = iris.feature_names
2. Initialize the model.
m = som.map(xdim=10,ydim=5,train=1000,norm=False)
Parameters:
- xdim,ydim - the dimensions of the map
- alpha - the learning rate, should be a positive non-zero real number
- train - number of training iterations
- norm - normalize the input data space
3. Train the data.
m.fit(data,labels)
Parameters:
- data - a dataframe where each row contains an unlabeled training instance
- labels - a vector or dataframe with one label for each observation in data
4. Compute the relative significance of each feature and plot it
m.significance()
Parameters:
- graphics - a switch that controls whether a plot is generated or not
- feature_labels - a switch to allow the plotting of feature names vs feature indices
5. Compute the convergence index of a map
m.convergence()
1.0
parameters:
- k - the number of samples used for the accuracy computation
- conf_int - the confidence interval of the accuracy test (default 95%)
- verb - switch that governs the return value, false: single accuracy value is returned, true: a vector of individual feature accuracies is returned.
- interval - a switch that controls whether the confidence interval is computed.
Return:
- return value is the estimated topographic accuracy.
6. Evaluate the embedding of a map using the F-test and a Bayesian estimate of the variance in the training data
m.embed()
1.0
Parameters:
- conf_int - the confidence interval of the convergence test (default 95%)
- verb - switch that governs the return value false: single convergence value is returned, true: a vector of individual feature congences is returned.
Return value:
- return is the cembedding of the map (variance captured by the map so far)
Hint:
- the embedding index is the variance of the training data captured by the map;
- maps with convergence of less than 90% are typically not trustworthy.
- Of course, the precise cut-off depends on the noise level in your training data.
7. Measure the topographic accuracy of the map using sampling
m.topo()
{'val': 0.97999999999999998, 'lo': 0.93999999999999995, 'hi': 1.0}
Parameters:
- conf_int - the confidence interval of the quality assessment (default 95%)
- k - the number of samples used for the estimated topographic accuracy computation
- verb - if true reports the two convergence components separately, otherwise it will report the linear combination of the two
- ks - a switch, true for ks-test, false for standard var and means test
Return:
- return value is the convergence index
8. Compute and display the starburst representation of clusters
m.starburst()
Parameters:
- explicit - controls the shape of the connected components
- smoothing - controls the smoothing level of the umat (NULL,0,>0)
- merge_clusters - a switch that controls if the starburst clusters are merged together
- merge_range - a range that is used as a percentage of a certain distance in the code to determine whether components are closer to their centroids or centroids closer to each other.
9. Plot that shows the marginal probability distribution of the neurons and data
m.marginal(0)
m.marginal(1)
m.marginal(2)
m.marginal(3)
Parameters:
- marginal is the name of a training data frame dimension or index
10. Print the association of labels with map elements
m.projection()
labels x y
0 0 9 2
1 0 8 0
2 0 9 1
3 0 8 0
4 0 9 2
5 0 8 4
6 0 8 1
.. ... .. ..
141 2 0 2
142 2 2 4
143 2 0 1
144 2 0 4
145 2 0 2
146 2 2 2
147 2 1 2
148 2 0 4
149 2 2 4
Return:
- a dataframe containing the projection onto the map for each observation.
11. Returns the contents of a neuron at (x,y) on the map as a vector
m.neuron(6,3)
array([ 5.21176518, 2.61068045, 3.63423014, 1.18464818])
Parameters:
- x - map x-coordinate of neuron.
- y - map y-coordinate of neuron.
Return:
- a vector representing the neuron.
Yuan, Li, "Implementation of Self-Organizing Maps with Python" (2018).