peterwittek / somoclu

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters
https://peterwittek.github.io/somoclu/
MIT License
268 stars 70 forks source link

Training SOM with GPU is slower than using CPU #89

Closed espg closed 6 years ago

espg commented 7 years ago

Running with GPU:

n_rows, n_columns = 200, 300
som = somoclu.Somoclu(n_columns, n_rows, compactsupport=False,  kerneltype=1, verbose=2)
som.train(data=dataDCT)

Yields:

Time for epoch 1: 479.6      10% [======                                            ]
Time for epoch 2: 479.3      20% [===========                                       ]
Time for epoch 3: 480.1      30% [================                                  ]
Time for epoch 4: 481.2      40% [=====================                             ]
Time for epoch 5: 482.9      50% [==========================                        ]
Time for epoch 6: 571.4      60% [===============================                   ]
Time for epoch 7: 1058      70% [====================================              ]
Time for epoch 8: 1263      80% [=========================================         ]
Time for epoch 9: 835.5      90% [==============================================    ]
Time for epoch 10: 481.4     100% [===================================================]

GPU is a Tesla P100 with 16GB of ram:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 0000:07:00.0     Off |                    0 |
| N/A   34C    P0    41W / 300W |   1705MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 0000:84:00.0     Off |                    0 |
| N/A   29C    P0    31W / 300W |      2MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16609    C   /usr/bin/python3                              1703MiB |
+-----------------------------------------------------------------------------+

There are two Tesla P100's, but during the run we only used the the first one (1.7 Gigs GPU ram usage).

Running on the same dataset, but without the GPU:

n_rows, n_columns = 200, 300
som = somoclu.Somoclu(n_columns, n_rows, compactsupport=False,  kerneltype=0, verbose=2)
som.train(data=dataDCT)

Yields:

Time for epoch 1: 233.7      10% [======                                            ]
Time for epoch 2: 228.6      20% [===========                                       ]
Time for epoch 3: 231.7      30% [================                                  ]
Time for epoch 4: 230.1      40% [=====================                             ]
Time for epoch 5: 229.1      50% [==========================                        ]
Time for epoch 6: 306.4      60% [===============================                   ]
Time for epoch 7: 636.2      70% [====================================              ]
Time for epoch 8: 766.2      80% [=========================================         ]
Time for epoch 9: 498.1      90% [==============================================    ]
Time for epoch 10: 227.9     100% [===================================================]

...which is about twice as fast as using the GPU. Both runs pretty much hammer the CPU's to 100% (CPU scaling works great --> 28 CPUs give 6x speedup compared to 4 CPUs).

Thoughts?

peterwittek commented 7 years ago

Odd. Somoclu's GPU kernel was written in the Fermi era, but it relies on CuBLAS, so it should be reasonable nimble on a Pascal GPU. For multi-GPU, your only option is the command-line version compiled with MPI.

You might have to saturate the GPU more. Try bigger maps or more data. Since you have 16GB GRAM, try to max it out and then compare with the CPU.

espg commented 7 years ago

The above benchmark was run on 10,000 sample vectors each of length 5041 features. I've setup a larger task that has 212,534 samples of the same 5041 vector feature length; I'll bump the map to either (2000 3000) or (1000 1500) too. I expect it'll take a couple days to run it with and without the GPU, but I'll post training time results after they finish and see if the GPU helps on the larger problem set.

espg commented 7 years ago

So...I can't get the GPU option to work at all on a larger dataset without running into memory problems. I get the following two errors:

terminate called after throwing an instance of 'thrust::system::detail::bad_alloc' what(): std::bad_alloc: out of memory

and:

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

The first error happens when the map is too big... I tried to get a layout that would max out the GPU but still be able to fit in memory.

The second error happens even on small maps-- i.e., 200 by 300. If I run a 200 by 300 neuron map on the CPU, I can see that it settles into 5GB of memory usage after a few minutes... but it does spike to about 20GB first before settling down. I other words, it looks like there is a bottleneck between setting up the SOM model and training the SOM. I'd expect to be able to use the GPU to update the map at small map sizes, but now it looks like there's some calculation other than neuron updates thats being passed to the GPU instead of the CPU and crashing due to lack of memory.

I'm not a CUDA programmer, but it looks like there might be a fix for this case using unified memory:

https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/

It looks like there's a transparent drop/replace of malloc() or new calls with a new cudaMallocManaged() allocation method that will allow paging to/from system memory...

peterwittek commented 7 years ago

Odd enough, the GPU version uses more memory, and this problem surfaces when you have a relatively low dimensional space (e.g. a few hundreds or a few thousands of dimensions), but many training instances. The problem is that a matrix of the size #training instances times #neurons is calculated in a CuBLAS call. So the problem you are facing might be related to this.

Back in 2010 when the GPU kernel was originally written, there was no managed memory. If you want to look into this, by all means, I would love to see an improvement.

espg commented 7 years ago

I've taken a look at the CUDA dense kernel, and I think I can figure out enough to phase out much of the thrust calls for allocating memory and rewrite it to run better on more modern hardware. I am a bit concerned that doing that will fix the memory issues above...but not address the GPU training slower than the dense CPU kernel.

Do you have any suggestions to test the setup as is to figure out if the GPU's will actually yield an improvement? I know that you mentioned saturating the GPU more... but there are a few ways to do that. I could increase the SOM map size. I could increase the feature vector (i.e., 5041 --> 40,000). I could increase the training set (something between the 10K I ran before and the 212K that is my full data set). Ideally, I like to know that performance using the GPU will scale in all of the above cases...but perhaps I can only expect it to scale with map size or feature size?

Do you know or have a reference for the time complexity of the algorithm in terms of map size vs training size vs training dimensionality?

peterwittek commented 7 years ago

Thanks. Your safest bet is increasing the dimensionality of the feature space. I would be very surprised if the GPU was slower.

The complexity is easy to estimate. It is quadratic in the number of neurons and linear in the number of data instances (finding a BMU is quadratic, plus you have a matrix multiplication). The dense case is also linear in the number of dimension (from the matrix multiplication).

Retiring Thrust would be great. We had issue #12 open on this since 2015. We did some preliminary work on this in the cuda_opt branch.

noofaq commented 5 years ago

I am currently experiencing very similar issue.

I am using latest Amazon DL AMI with somoclu installed following the documentation (installation with CUDA).

Test with kerneltype=0 takes around 40s for epoch, kerneltype=1 results with ~180s. I notice that both runs do use all available CPU cores; when called with kerneltype=1 nvidia-smi registers a connected process with some memory reserved (but it is rather small - like .5GB) though utilization is always 0%. For me it looks like GPU is not used for training at all. Data do have ~1000 features and there are ~50k data rows.

Are there any ways to find a reason for such issues?

chapleau commented 4 years ago

a bit late to the party but I am having same issue with current master branch but only when using python interface. With the binary command line version, I do see some speed up from using the dense GPU kernel (roughly 30%). Didn't have much time to dig into it, but I thought it could be a good hint if someone is willing to take a look.

chapleau commented 4 years ago

ah, got it, compiler options are differents. that did the trick for me (in setup.py):

extra_compile_args = ['-fopenmp','-ffast-math', '-march=native']

and for nvcc:

'--compiler-options', '-fPIC -fopenmp -O3 -ffast-math -march=native'