Use optimized lingebra math libraries

subutai commented 10 years ago

This super issue plans workflow for speed optimizations by using a specialized library.

Benefits:

SPEED!
fewer our-manually optimized (=hacked) code; cleanup
bugs/improvements delegated to lib's upstream
better portability
use of parallel cores (openMP), special CPU instructions (SSE,..), GPGPU backends

Requirements:

usability
- suitable licence
- platform support (Linux/Mac/Win; x86_64)
- convenient installation/bundling with nupic
functionality
- SSE instructions
- GPGPU backend support (CUDA, openCL)
- parallelism support (openMP)
- sparse matrices
programming
- clean & lean API
- active development
- (opt) bindings to other languages we use (Python)

Workflow:

[ ] decide on library implementation to use
create profiling/benchmark tools
- [ ] I have some python SP/TP profiling tools that can use the c++ implementation and roughly show methods that are bottlenecks (sp_large.py tp_large.py)
- [x] proper c++ profiling setups
[ ] hello world usecase using the chosen lib
focus on Temporal pooler - the current bottleneck
- [ ] replace ASM code with the library calls
- [ ] find another for-loop candidates for vectorization
- [ ] another function calls that can be sped up using the library?
[ ] Optimize Connections for Temporal memory
[ ] Optimize SparseMatrix classes (cleanup, mem reduction)
Optimize other (less significant parts)
- [ ] Optimize Spatial pooler
Misc

breznak commented 10 years ago

is this still an issue? (given the optimizations were not that big?) Before I was suggesting multiplatform eigen library, but not sure if we should bother at this time.

breznak commented 9 years ago

relevant: #193 #151

breznak commented 9 years ago

@subutai would you mind if I reword the issue a bit? former description:

subutai commented on Feb 20, 2014
See issue #27. We'd like to possibly add it back in later so tracking it here. Some related web pages:

https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man7/vecLib.7.html

Before adding it back in we should verify this really gives a performance improvement in real cases. This is doubtful.

breznak commented 8 years ago

When optimizing some critical parts of C++ code, this is pretty neat tool! : http://gcc.godbolt.org/#{%22version%22%3A3%2C%22filterAsm%22%3A{%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue}%2C%22compilers%22%3A[{%22sourcez%22%3A%22C4TwDgpgJhBmAEUD2BXARgGwvAbhAxsEgE7wD6ZAhsMMQJZorAQXwAUbehJZAznQC8IbAMwAmAJRSA3AChk6LIiTA2%2BJADtewXASLEAZPEoAadVp1d9RtBNkBvWee27upfPAC8xgFRo5xBDAKMQa7PgA2gAMALoA1JEAjDEScWoRYvGRIilyAL5AAAA%3D%22%2C%22compiler%22%3A%22g530%22%2C%22options%22%3A%22-Os%20-mavx%22}]}

FYI @oxtopus

rhyolight commented 8 years ago

Please review this issue

This issue needs to be reviewed by the original author or another contributor for applicability to the current codebase. The issue might be obsolete or need updating to match current standards and practices. If the issue is out of date, please close. Otherwise please leave a comment to justify its continuing existence. It may be closed in the future if no further activity is noted.

breznak commented 8 years ago

This is still valid, although noone is at the time working on the porting to lingebra libraries. I think it should stay open to monitor optimization progress and results. E.g. the PRs from @mrcslws speeding up TM could be referenced here for record.

rhyolight commented 8 years ago

Ok, so the issue is still valid, but it is also defined very broadly. It's labeled type:optimization so I'll track it that way, but I think the ticket description needs to be simplified. It's too long and complicated, and too many subjects and TODO items. We need to try to keep our issues simpler and smaller. This could turn into a super issue, but honestly I would rather break it up even farther. Something to think about @subutai.

subutai commented 8 years ago

@rhyolight Agreed. The issue is indeed pretty big right now. I think a good first step is to replace the use of sparse matrices in the python spatial pooler, python KNN classifier, and/or optimize the existing C++ SpatialPooler (which is not currently too optimized).

breznak commented 8 years ago

I think a good first step is to replace the use of sparse matrices in the python spatial pooler, python KNN classifier, and/or optimize the existing C++ SpatialPooler (which is not currently too optimized).

@subutai shouldn't the effort focus on the big-impact first? Aka the biggest bottlenecks, which is still TM/TP?

Another point, I think all optimizations would be slower in global scope first due to the needed conversions; so a good first move would be to implement an object (which we can then easily exchange when experimenting with optimizations) for a dense and sparse vector, goal of #948

You all will have to please forgive me for my novice understanding of the code (I'm still learning it... slowly), but I wanted to understand what kinds of calculations are being made within nupic that could require a library like Eigen or Armadillo or MKL or OpenBLAS or whatever. Is there massive matrix multiplication going on? Vector multiplication? Even if someone could just point me to proper class/function/file so I could get a better handle on it, I think I could offer up some help with this.

@jshahbazi Sorry, I missed your call, if you are still interested, we certainly would! The logic and operations are in algrithms/Connestions.hpp (for TemporalMemory) and in math/{Sparse,Dense}Matrix (for SpatialPooler).

The operations (someone please correct me): vector AND, searching N-highest entries, indexing and updating weights, ... @scottpurdy @mrcslws @subutai ?

The code can be benchmarked (globally, for a typical use) using #890 . Also please weight in on #948

subutai commented 8 years ago

shouldn't the effort focus on the big-impact first? Aka the biggest bottlenecks, which is still TM/TP?

The TM is actually not the biggest bottleneck right now. After changes by @mrcslws it is a pretty small part of the overall profile.

breznak commented 8 years ago

The TM is actually not the biggest bottleneck right now. ...

@subutai not really, it still is (even the code complexity compared to SP is higher)

Please see https://github.com/numenta/nupic/pull/3131 for my benchmarks:

fastest SP (c++ "2D" SP): 0.040 s/call
fastest TM/TP (cpp TP): 0.040 s/call
- fastest TM: 0.158 s/call

The old SP problem I've discovered with 1D vs 2D inputs: https://github.com/numenta/nupic.core/issues/380 Problem with TM speed: https://github.com/numenta/nupic.core/pull/890#issuecomment-219260326

breznak commented 8 years ago

We need to try to keep our issues simpler and smaller. This could turn into a super issue, but honestly I would rather break it up even farther

@rhyolight this IS a super issue with links to sub-issues where possible/active

breznak commented 8 years ago

Added https://github.com/numenta/nupic.core/issues/967 as a proposal that would halve the computation time easily.

subutai commented 8 years ago

not really, it still is (even the code complexity compared to SP is higher)

@breznak I will let @mrcslws comment on this. According to Marcus, when you run hotgym, the new TM is a small percentage of the overall profile. Marcus - am I mis-remembering?

I took a quick look at #3131 and sp_profile. I don't remember seeing this script before but it looks like the SP parameters are quite off in sp_profile. Why is potentialRadius only 3? It should be much larger to form good SDRs. Same with numActiveColumnsPerInhArea, etc. etc. I think the parameters should be set to realistic numbers and the profile re-run with those numbers.

mrcslws commented 8 years ago

I commented on https://github.com/numenta/nupic/pull/3131#issuecomment-221298390.

numenta / nupic.core-legacy

Use optimized lingebra math libraries #28

Please review this issue