GPGPU support - Githubissues

rhyolight commented 10 years ago

From: https://issues.numenta.org/browse/NPC-258

Evaluate if/how GPGPU programming techniques (openCL) can be used to speed up the CLA.

rhyolight commented 10 years ago

Comments

Frank Astier said:

We've tried GPU, with the GPUs on the MBPs. At the time it didn't work. You'll find that the bottleneck is transferring vectors to the GPU memory. Once the vectors are loaded in the GPU memory, computations there can be efficient, but we couldn't get over that transfer bottleneck. It's been observed in other applications. I don't know what the status of that would be today. I'm guessing you would need a beefy GPU card - probably not worth it on laptop MBP.

@breznak said:

Frank, I'd be interested in what technology/specific card?/and or usecase you did these tests for? I agree getting the date to the GPU is a bottleneck, wonder how would modern APUs/GPUs+CPUs on one chip behave? And with today's 2GB mem GPUs being quite common, maybe this could be still worth for sb looking for speed.

Frank said:

It was the card on the MBP 4 years ago. We didn't keep the use cases. We didn't look at exotic hardware because nupic needed to be available on OTC hardware.

@h2suzuki said:

I am very interested in this topic. Once C++ implementation becomes clean, I definitely want to get involved here. I only know of CUDA a little bit, but I believe today's GPUs tend to provide many features to hide the high latency of memory transfer like streams, pin memory, ...,etc. I also strongly feel that we need special considerations in design to take the most advantage of GPU-like massively parallel but far-away-from-CPU computing resource. e.g. avoiding pointers and lock contentions, data parallelism, warp, caches ...,etc. What platform have you tried before? IMHO, thrust is probably the best thing to try out as the initial step, after the good amount of design strategy discussion. http://thrust.github.io/

Frank said:

Hideaki -

Which parts of the C++ implementation do you find not clean? I wrote the maths libraries (SparseMatrix, SparseTensor...). IMHO, if the algos still use the sparse matrices, another issue I forgot to mention is that sparsity can kill the advantage of GPUs: the critical factor with sparse matrices is that there are jumps all over the memory when you do a matrix/vector multiplication (or matrix/matrix multiplication), and on a CPU, that's bad for the cache lines. GPU's are probably not so different (?). So, IMHO for GPU's there are 2 difficulties: move the data to the GPU memory fast enough, and then see how the sparsity affects caching.

The sparse matrix implementations in the maths library do not have contentions issues. Pointers usage is minimal (the non-zeros can all be stored contiguously). Making matrix/vector parallel without threads is trivial: divide the matrix in blocks of rows. Cache lines invalidation is a harder problem IMHO (see above).

At the time we tried GPUs, we didn't have GPU cards, but there were GPUs accessible via CUDA on our MBPs. That's what we used (and of course, that was not much of a GPU).

For the maths library, I don't know if thrust would be useful: sparse matrix doesn't really need STL for example, and some bits of it were even coded in x86_64 assembly.

@h2suzuki said:

Frank san, I'm sorry for any confusion caused. I thought refactoring on SP was on-going, which I wanted to mention. It seems refactoring has already completed (NPC-226), so I'm little outdated in that sense. I should look into the code more deeply now. For the design discussion, in fact, I don't have any solid opinion yet. I'm only wondering a simple array (rather than a sparse matrix) might work better for this kind of device. For example, the input bits are transferred to GPU through its constant memory. The region is setup within GPU by GPU itself. Each column will have a thread (and warps) in GPU to calculate its overlap (no loop is best), probably with texture memory for good cache line. SP Inhibition can be done through fixed-time distributed prefix sorting, or bitonic sorting (thrust). A GPU SDK sample can sort 32M elemens in a second on my windows desktop with GTX 560, which is now an old product by two generations. The indices of the active columns can be stored in a shared memory to compute TP. and so forth. I am not particularly talking about your maths libraries for the pointers. I'm just saying like, a simple linked list is not good for GPU kind of things, but do we have many linked lists in TP? (No?) IMHO, in order to take full advantage of GPU, we need tens of thousands of items to compute at the same time. Otherwise, CPU should be faster. Or the gain by GPU will be easily offset by memory transfer overhead. You know... Still, even after all efforts are made, it might end up with no speed up from the last experiment. I can't really tell, because I don't have any data yet.

Frank said:

Hideaki (-kun? ;-)

I am not sure what is the current state of the SP/TP, I haven't looked at that code in a long time.

For sparse matrix as dense matrix, there could be 2 problems: 1) real sparsity is around < 1% non-zeros, so multiplying by 0 most of the time might remove the advantage of a GPU ; 2) the dense form of the matrix might not even fit in the GPU memory, if it's really big.

But that's only about sparse matrices, again, I'm not sure what is currently in the SP/TP. I don't think they use any linked lists, nor maps, not sets - these were all too expensive at runtime.

Can you point me to the SP/TP code you are looking at? It might job my memory :-)

PS: are you in the US? in Japan? I lived in Japan for many years :-)

@h2suzuki said:

Haha. My boss sometimes called me with -kun in the office, so I'll probably have a feeling of tension when I'm called in such a way. ;-)

And, I understand your point.

The current SP/TP code can be found at the following location. I guess other people can provide better information than me.

https://github.com/numenta/nupic/tree/master/nta/algorithms

SP: FDRSpatial.hpp (or FDRCSpatial.hpp?)

TP: Cells4.{hpp,cpp}

To get GPGPU support done, I'm thinking of the following tasks as prerequisite.

Porting the code to Windows platform, or getting another PC that I can install Linux.

I use RHEL extensively for my job. So, I might choose RHEL for the distro of my choice.

Putting many probes to collect performance statistics.

Review the code to design GPGPU support.

Only portion of computation to be done in GPU? All to be done in GPU? or Use GPU and CPU both in parallel?

Any architecture change required to support GPU?

What platform should we use? CUDA, OpenCL, Thrust, cuSPARSE, or a hybrid of those, ...,etc.

Then, I can start experiments for this issue. I believe this is a big task.

PS: I was in India for the last month, but now I'm in Japan. (y)

Frank said:

I think I wrote at least the original FDRSpatial and Cells4 - although it looks like they have been changed somewhat... I am wondering if that code is not too old to look at, whether it is still used in the CLA or not.

I like your plan. RHEL is ok, but IMHO it tends to be a little bit behind in terms of tools (maybe just an impression I have).

I won't call you "-kun" - although it could almost have been justified because I'm kind of your 先輩 here - but we don't need to do that :-) 日本のどこでいますか？僕はずっと東京にすみました。

@h2suzuki said:

東京でしたか。実は僕も、いま中央線の沿線に住んでいますよ。日本語も上手ですね。 :-)

It looks like cmake is getting into the code soon. I'll try to run CLA on Windows if nobody is going to try.

For GPGPU acceleration, after pondering sometime, I noticed my C# implementation of CLA was actually close to CSR. I have a number of columns. Each column has a variable length list of connected synapses (just like someone mentioned in ML). Each synapse of the proximal segment of a column has its index against the input array.

So, computing overlap is basically to prefix scan on all of these variable length lists.

In order to make the best result out of parallelism, all list should have the equal length. However, SP will change the lengths of connected synapse lists by learning. To resolve this issue, I searched the Net and found the following.

Segmented Scan:

https://research.nvidia.com/publication/efficient-parallel-scan-algorithms-gpus

http://qiita.com/aokomoriuta/items/3c2a80181a01c7f22e7f (in Japanese)

thrust::exclusive_scan_by_key (thrust's version of segmented scan)

http://docs.thrust.googlecode.com/hg/group__segmentedprefixsums.html#ga115ffd69c3e45736772cbdf2b74b9097 (note: its example has typos)

Segmented scan aggregate all variable length lists. Then, it partitions the data in equal size to obtain the best parallelism.

Currently, my C# implementation takes 15 - 20 ms on 1024 columns with 630 connected synapses without compiler optimization. If anybody has not tried segmented scan on SP, I will implement it on my C# code and will report the result.

My desktop has one quad core CPU with HT. We can see at least how promising segmented scan is when we compute overlap.

h2suzuki commented 10 years ago

The following thread includes segmented scan in its core idea (in the form of thrust library) http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2014-May/003842.html

zorba-the-geek commented 10 years ago

Stumbled across this interesting project: ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs.

They have CUDA and OpenCL backends. Recently a python wrapper library was released: https://pypi.python.org/pypi/pyviennacl

Don't know how applicable this is. I haven't done GPU or high performance computing.

EDIT: I ran into this from this SE answer: http://scicomp.stackexchange.com/questions/10354/cuda-vs-opencl-as-of-late-2013

breznak commented 10 years ago

moved to https://github.com/numenta/nupic.core/issues/193

numenta / nupic-legacy

GPGPU support #1067

Comments