ioam / topographica

A general-purpose neural simulator focusing on topographic maps.
topographica.org
BSD 3-Clause "New" or "Revised" License
53 stars 32 forks source link

Sparse GPU Topographica implementation and tests (cleaned up commit version) #621

Closed Tasignotas closed 9 years ago

Tasignotas commented 9 years ago

These are the changes I've been working on for my bachelor's project - a GPU-based version of Topographica that with sparse projections, making the simulations several times faster.

A detailed description of the architecture of the GPU Topographica together with design justifications, implementation issues and benchmarking results can be found on http://homepages.inf.ed.ac.uk/s1137931/thesis.pdf.

This is a branch with cleaned commit history with all of the changes done, split into 4 commits.

jbednar commented 9 years ago

Perfect; thanks so much!

jlstevens commented 9 years ago

Fantastic!

I am looking forward to making use of your work. If I could get an 8X speed up on TCAL I would be absolutely delighted - having a simulation take 6 hours instead of 2 days would really make a huge difference to me!

Tasignotas commented 9 years ago

Glad to see it finally merged. Would be great if you let me know the speedups you manage to achieve.

mjabri commented 8 years ago

Hi everybody I am looking at the sparse/gpu implementation and manage to run some tests,though with the following caveats: 1- the scikits.edu.cusparse seems to have changed and some functions are missing (the version that installs with pip). So i found another implementation (https://github.com/grlee77/python-cuda-cffi) which seems to work. But the test results are not that encouraging.

2- I have run the gcal_sparse.ty model on cpu (4 physical cores) and on gpu (GeForce 980m), and here what i get:

on gpu (GeForce 980) after runing 15 by hand):

topo_t000015.00_c1>>> %time topo.sim.run(10000) CPU times: user 7min 33s, sys: 726 ms, total: 7min 34s Wall time: 2min 32s

on CPU (4 cores ...)

topo_t001015.00_c2>>> %time topo.sim.run(10000) CPU times: user 21min 51s, sys: 778 ms, total: 21min 52s Wall time: 2min 46s

So even though the cores are utilized at 1/4 of the capacity when in gpu mode, it seems the performance of gpu over cpu is non-existent from a time-lapse perspective. Has anybody observed this?

3- Looking at the sparce implementation of CF/Projection, it seems the signature of the functions (response_fn, learning_fn, ..) are different from CFProjection, This means one cannot easily switch between a non-sparse implementation and a sparce/gpu one. I can hack the signatures to make them callable, but that would be a horrible hack. Any suggestions?

Tasignotas commented 8 years ago

Hi @mjabri,

sorry for taking so long to reply. What cortex density value are you using for the V1 sheets? As described in the conclusions of my thesis, there is almost no advantage in using GPU when the model has relatively low cortical density, but may run several times faster than the CPU simulation when high cortical density (like 162) is selected.

Please let me know your findings and if the GPU implementation is still not faster for you under high cortical density values, we can try to investigate further.

Ignotas

mjabri commented 8 years ago

Thanks Ignotas, I will run with cortex_density of 162 and share the results.

mjabri commented 8 years ago

Indeed, as Ignotas mentioned above there is a BIG difference indeed when V1 density is larger. I tried at 162 and here the results (note GPU is a GeForce 980m and CPU is a i7-4860HQ). I run twice for eacheach:

CORTEX DENSITY 162.0: GPU (two separate runs): topo_t000010.00_c4>>> %time topo.sim.run(10000) CPU times: user 28min 44s, sys: 3min 36s, total: 32min 21s Wall time: 26min 57s topo_t000010.00_c3>>> %time topo.sim.run(10000) CPU times: user 29min 6s, sys: 3min 15s, total: 32min 21s Wall time: 26min 58s

CPU (two separate runs): topo_t000010.00_c4>>> %time topo.sim.run(10000) CPU times: user 23h 5min 30s, sys: 9.67 s, total: 23h 5min 40s Wall time: 2h 53min 27s topo_t000001.00_c1>>> %time topo.sim.run(10000) CPU times: user 23h 9min 58s, sys: 9.58 s, total: 23h 10min 7s Wall time: 2h 53min 59s

As I still cannot display projections/CFs i looked at activities of both GPU and CPU side by side, and they looked identical to my eyes.

I haven't looked at the GPU implementation closely, so I don't really understand where the GPU benefits are coming from, and whether it is specific to GCAL with there are many zeros (to Philipp's point) in the artificially generated input patterns, and whether these benefits would still exist in the case of natural images.

As i have problem displaying projections (the KeyError problem mentioned above). So it would be good to know whether others have this KeyError problem or whether it is specific to me!! I tried on two systems, one VM (so cannot run GPU, but still problem in CPU mode) and one PM, and they both show the same KeyError problem. If this issue can be resolved i could then spend more efforts at the Sparse/GPU API...

Thanks

Marwan

philippjfr commented 8 years ago

The benefits are almost certainly just down to the memory bandwidth of the GPU vs CPU. For small densities or areas the CPU can probably fit a lot of each operation in the CPU cache and doesn't have to constantly wait on new chunks of memory to be transferred. In larger models the CPU is probably starved for data and spends a lot of time waiting, while the GPU can fit much larger chunks in it's memory and process them. So I don't think it has anything to do with the sparsity in this case.

Actual sparsity is probably just another way the GPU can outperform the CPU implementation because unlike the CPU the GPU uses sparse arrays. I do also think that the GPU performance gains should be independent from the input patterns (although sparser patterns probably do have some effect).

I'll start looking into the GPU implementation myself later this week for my own work so hopefully I'll have some news on your KeyError issue.

mjabri commented 8 years ago

Ok, BTW, the KeyError 'Afferent' I am getting is not only on gcal_sparse.ty but also on gcal.ty. Also, tiny.ty seems to show projections ok.

philippjfr commented 8 years ago

Very odd they work fine for tiny but not for gcal. As I said I'll look into it.