Closed espg closed 6 years ago
Odd. Somoclu's GPU kernel was written in the Fermi era, but it relies on CuBLAS, so it should be reasonable nimble on a Pascal GPU. For multi-GPU, your only option is the command-line version compiled with MPI.
You might have to saturate the GPU more. Try bigger maps or more data. Since you have 16GB GRAM, try to max it out and then compare with the CPU.
The above benchmark was run on 10,000 sample vectors each of length 5041 features. I've setup a larger task that has 212,534 samples of the same 5041 vector feature length; I'll bump the map to either (2000 3000) or (1000 1500) too. I expect it'll take a couple days to run it with and without the GPU, but I'll post training time results after they finish and see if the GPU helps on the larger problem set.
So...I can't get the GPU option to work at all on a larger dataset without running into memory problems. I get the following two errors:
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc' what(): std::bad_alloc: out of memory
and:
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc
The first error happens when the map is too big... I tried to get a layout that would max out the GPU but still be able to fit in memory.
The second error happens even on small maps-- i.e., 200 by 300. If I run a 200 by 300 neuron map on the CPU, I can see that it settles into 5GB of memory usage after a few minutes... but it does spike to about 20GB first before settling down. I other words, it looks like there is a bottleneck between setting up the SOM model and training the SOM. I'd expect to be able to use the GPU to update the map at small map sizes, but now it looks like there's some calculation other than neuron updates thats being passed to the GPU instead of the CPU and crashing due to lack of memory.
I'm not a CUDA programmer, but it looks like there might be a fix for this case using unified memory:
https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
It looks like there's a transparent drop/replace of malloc()
or new
calls with a new cudaMallocManaged()
allocation method that will allow paging to/from system memory...
Odd enough, the GPU version uses more memory, and this problem surfaces when you have a relatively low dimensional space (e.g. a few hundreds or a few thousands of dimensions), but many training instances. The problem is that a matrix of the size #training instances times #neurons is calculated in a CuBLAS call. So the problem you are facing might be related to this.
Back in 2010 when the GPU kernel was originally written, there was no managed memory. If you want to look into this, by all means, I would love to see an improvement.
I've taken a look at the CUDA dense kernel, and I think I can figure out enough to phase out much of the thrust calls for allocating memory and rewrite it to run better on more modern hardware. I am a bit concerned that doing that will fix the memory issues above...but not address the GPU training slower than the dense CPU kernel.
Do you have any suggestions to test the setup as is to figure out if the GPU's will actually yield an improvement? I know that you mentioned saturating the GPU more... but there are a few ways to do that. I could increase the SOM map size. I could increase the feature vector (i.e., 5041 --> 40,000). I could increase the training set (something between the 10K I ran before and the 212K that is my full data set). Ideally, I like to know that performance using the GPU will scale in all of the above cases...but perhaps I can only expect it to scale with map size or feature size?
Do you know or have a reference for the time complexity of the algorithm in terms of map size vs training size vs training dimensionality?
Thanks. Your safest bet is increasing the dimensionality of the feature space. I would be very surprised if the GPU was slower.
The complexity is easy to estimate. It is quadratic in the number of neurons and linear in the number of data instances (finding a BMU is quadratic, plus you have a matrix multiplication). The dense case is also linear in the number of dimension (from the matrix multiplication).
Retiring Thrust would be great. We had issue #12 open on this since 2015. We did some preliminary work on this in the cuda_opt branch.
I am currently experiencing very similar issue.
I am using latest Amazon DL AMI with somoclu installed following the documentation (installation with CUDA).
Test with kerneltype=0 takes around 40s for epoch, kerneltype=1 results with ~180s. I notice that both runs do use all available CPU cores; when called with kerneltype=1 nvidia-smi registers a connected process with some memory reserved (but it is rather small - like .5GB) though utilization is always 0%. For me it looks like GPU is not used for training at all. Data do have ~1000 features and there are ~50k data rows.
Are there any ways to find a reason for such issues?
a bit late to the party but I am having same issue with current master branch but only when using python interface. With the binary command line version, I do see some speed up from using the dense GPU kernel (roughly 30%). Didn't have much time to dig into it, but I thought it could be a good hint if someone is willing to take a look.
ah, got it, compiler options are differents. that did the trick for me (in setup.py):
extra_compile_args = ['-fopenmp','-ffast-math', '-march=native']
and for nvcc:
'--compiler-options', '-fPIC -fopenmp -O3 -ffast-math -march=native'
Running with GPU:
Yields:
GPU is a Tesla P100 with 16GB of ram:
There are two Tesla P100's, but during the run we only used the the first one (1.7 Gigs GPU ram usage).
Running on the same dataset, but without the GPU:
Yields:
...which is about twice as fast as using the GPU. Both runs pretty much hammer the CPU's to 100% (CPU scaling works great --> 28 CPUs give 6x speedup compared to 4 CPUs).
Thoughts?