Parallel execution, Multithreading

breznak commented 5 years ago

Aim: run algorithms in parallel as much as possible, with maximal efficiency.

This will be an umbrella issue based on following idea:

I see 3 different levels of parallelisation that could be implemented:

low-level: , for-loops, implicit, by c++17 Parallel -> #214
- this reduces (if??) execution time of each,smallest code-unit
hand-coded/mid-level: efficient, makes code much more difficult -> #254
high-level: NetworkAPI Regions, classes SP,TM,Encoder,Classifier in a pipeline could run in parallel
- reduces pipeline time to the min(time of each unit)
- tracked in #253
from https://github.com/htm-community/nupic.cpp/issues/214#issuecomment-454808155

EDIT:

TODO:

[ ] can we estimate number of threads on system?
- [ ] at runtime
- [ ] at compile time with CMake?
- [ ] worst case we pass a param -DNUMTHREADS=8

ctrl-z-9000-times commented 5 years ago

Thanks for starting a tracker issue for this subject!

A computer has a fixed number of threads and so it does not make sense to parallelize everything at every level. We should focus on 1 or 2 high-impact areas.

dkeeney commented 5 years ago

I agree, and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

breznak commented 5 years ago

so it does not make sense to parallelize everything at every level.

True, I was considering this too. On the other hand, you can, and will now easily get some (Ryzen) CPU with 16/32+ threads. Or threre is specialized HW, clusters,...

I think this will be disadvantage (not to paralellize too much) of the low-level #214 approach.

For the latter 2, I'd suggest sth as:

-DNUM_PARALLEL_REGIONS
NUM_THREADS_PER_REGION (better name? not region, but SP, TM)

So if you have large network, you'll get best utilization by giving all threads to the simultanously running regions. But in the case like MNIST example (my motivation), we are 1 algo/region (SP), so one could try to give threads to that.

Now, giving it a priority would make the process seamless:

satisfy all running Regions.
- give rest to threads within an algo (SP)
- or a region could specify that (case where some region/load is significantly slower)
  - even rest for threads-per-loop (the low level TS PArallel case, which is not coming anytime soon)

and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

definitely, it would probably be also first thing we will achieve But see the MNIST case, and Region(preferredNumThreads=N) for slow workloads

dkeeney commented 5 years ago

All of the algorithms are basically compute bound. This means that priorities will not get things done faster. But spreading the work over multiple threads would help. My machine has 6 cores (12 physical threads) and they could all be put to work. But even there the best I could expect is around 10 times as fast.

Ultimately the biggest advantage would be to reduce the amount of work that needs to be done. Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

breznak commented 5 years ago

Yes, but even linear in #threads would be nice win.

Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

yep, that goes in parallel (sic) with this, #3 . It would be nice if we can vectorize the for loops, it was my intention with SDR_t, but I'm not sure SDR is helping that (at least it prepares as for sparse data structures)

dkeeney commented 5 years ago

hmmm, 'vectorizing' the for loops does not get rid of the loops. They are just harder to find. What I mean is using things like map and set objects in places where it makes since to avoid having to iterate. Go from O(n) to O(log n) types of things. But I don't know, it might require a whole new way of looking at the algorithms to make that work.

htm-community / htm.core

Parallel execution, Multithreading #255