htm-community / htm.core

Actively developed Hierarchical Temporal Memory (HTM) community fork (continuation) of NuPIC. Implementation for C++ and Python
http://numenta.org
GNU Affero General Public License v3.0
150 stars 74 forks source link

Parallel execution, Multithreading #255

Open breznak opened 5 years ago

breznak commented 5 years ago

Aim: run algorithms in parallel as much as possible, with maximal efficiency.

This will be an umbrella issue based on following idea:

I see 3 different levels of parallelisation that could be implemented:

EDIT:

TODO:

ctrl-z-9000-times commented 5 years ago

Thanks for starting a tracker issue for this subject!

A computer has a fixed number of threads and so it does not make sense to parallelize everything at every level. We should focus on 1 or 2 high-impact areas.

dkeeney commented 5 years ago

I agree, and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

breznak commented 5 years ago

so it does not make sense to parallelize everything at every level.

True, I was considering this too. On the other hand, you can, and will now easily get some (Ryzen) CPU with 16/32+ threads. Or threre is specialized HW, clusters,...

I think this will be disadvantage (not to paralellize too much) of the low-level #214 approach.

For the latter 2, I'd suggest sth as:

So if you have large network, you'll get best utilization by giving all threads to the simultanously running regions. But in the case like MNIST example (my motivation), we are 1 algo/region (SP), so one could try to give threads to that.

Now, giving it a priority would make the process seamless:

and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

definitely, it would probably be also first thing we will achieve But see the MNIST case, and Region(preferredNumThreads=N) for slow workloads

dkeeney commented 5 years ago

All of the algorithms are basically compute bound. This means that priorities will not get things done faster. But spreading the work over multiple threads would help. My machine has 6 cores (12 physical threads) and they could all be put to work. But even there the best I could expect is around 10 times as fast.

Ultimately the biggest advantage would be to reduce the amount of work that needs to be done. Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

breznak commented 5 years ago

Yes, but even linear in #threads would be nice win.

Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

yep, that goes in parallel (sic) with this, #3 . It would be nice if we can vectorize the for loops, it was my intention with SDR_t, but I'm not sure SDR is helping that (at least it prepares as for sparse data structures)

dkeeney commented 5 years ago

hmmm, 'vectorizing' the for loops does not get rid of the loops. They are just harder to find. What I mean is using things like map and set objects in places where it makes since to avoid having to iterate. Go from O(n) to O(log n) types of things. But I don't know, it might require a whole new way of looking at the algorithms to make that work.