training and initialization parameters for extreme large data

passionato commented 7 years ago

I have set up the somoclu in python, up and running fine in small datasets. However, I am testing it with a dataset of 500K rows (observations) and 8000 columns (dimensions). the somoclu has been running for more than a week now, and no results! What is the best approach to adopt in such big data cases? how are the initial parameters fixed and determined?

peterwittek commented 7 years ago

It does not sound too big. That is about 15GB at single precision. Throw a bunch of GPUs at it and increase verbosity level to see how far it progresses.

The bottleneck is seldom the data. The size of the SOM is much more important.

passionato commented 7 years ago

so, how big , what dimensions should the SOM size be for this? are you talking about the grid size, i suppose? n_columns, n_rows ? what is the recommended method to calculate the size for each dataset? how about the number of epochs? how much effect does it have on this?

peterwittek commented 7 years ago

When I start training from scratch, I usually go for ten epochs. To my knowledge, there are no rules for picking the grid size. If n_columns*n_rows<|dataset|, you get a SOM, if it is the other way around, you get an ESOM. Training time increases linearly in the total number of nodes (n_columns*n_rows).

passionato commented 7 years ago

So I have a fundamental question, sorry for being a novice here.

I have a dataset of 1M observations and several thousands of columns (dimensions). Somoclu is running as a part of a big python program, doing its job where it is supposed to do. how can I set up a parallel computing for the somoclu part for this big dataset? How much do you estimate that it will take? How can i know that somoclu is running at its best and doing parallelization using all my cores?

I need an example to get it going, a sample script, or if you could give an anwser which is adapted to 1M dataset, i would be very much grateful to you.

My machine info is:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               1200.000
BogoMIPS:              7400.22
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7

given a first attempt at the code:

n_rows, n_columns = 20, 20
som = somoclu.Somoclu(n_columns, n_rows, maptype="planar", gridtype="rectangular", neighborhood="gaussian", initialization="pca", std_coeff=0.5,  verbose=2)
som.train(my_data, epochs=10, scalecooling='exponential')
labels = range(my_data.shape[0])
som.view_umatrix(bestmatches=True, labels=labels, filename="umatrix")
som_state = som.get_surface_state()
my_bmus = som.get_bmus(som_state)

and my VGA is:

$ lspci -vnnn | perl -lne 'print if /^\d+\:.+(\[\S+\:\S+\])/' | grep VGA
07:04.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 [102b:0532] (rev 0a) (prog-if 00 [VGA controller])

peterwittek commented 7 years ago

This is fine. You do not have a CUDA GPU, so you can forget about that. Run the code and check with top how many cores Somoclu uses. CPU utilization should be very close to the number of logical cores (including hypterthreaded ones) times 100%.

peterwittek commented 7 years ago

Actually, you ask for PCA initialization. That is done via scikit-learn, and most likely it is serial code. With the amount the amount of data you have, the SVD can take forever.

peterwittek / somoclu

training and initialization parameters for extreme large data #79