remove ASM code with no benefits

breznak commented 9 years ago

During my experiments I figured the hand-tuned ASM code (in ArrayAlgo.hpp, ...) actually has no benefit over pure c++ code, as

compilers progressed over the time
we use compiler optimizations now

the benefits of this change are:

speed ~0%
portability++
clean code++

Reference: #236 #382

I tested my on:

gcc 4.8 + LTO + O3, clang 3.5, gcc 4.8 defaults
with/without ASM code (-DNTA_ASM)
simple TP.compute() loop, CLA model with SP+TP+anomaly detection on ~20k records
Results:

LINUX:

full : https://gist.github.com/breznak/4eb73459b05c03ebbc1c

gcc 4.8, optimized (-O3, -flto):
- * ASM: 20000 191.310 0.010 191.310 0.010 {_algorithms.Cells4_compute}
- * NO ASM: 20000 200.184 0.010 200.184 0.010 {_algorithms.Cells4_compute}
clang 3.5 (-O3):
- * ASM: 20000 225.019 0.011 225.019 0.011 {_algorithms.Cells4_compute}
- * NO ASM: 20000 222.895 0.011 222.895 0.011 {_algorithms.Cells4_compute}
gcc 4.8 (defaults, -O2, used for binary releases):
- * ASM: 20000 232.067 0.012 232.067 0.012 {_algorithms.Cells4_compute}
- * NO ASM: 20000 226.108 0.011 226.108 0.011 {_algorithms.Cells4_compute}
  OSX

(on OS X 10.10 by @rhyolight :+1: )

results of python -m cProfile --sort cumtime tp_large.py

remove_asm (52.253 seconds)
master (53.436 seconds)

2nd test: SP+TP+anomaly model on 20k records (note: different model and data than I used on linux, see discussion. But that does not matter.)

remove_asm (157.513 seconds)
master (159.679 seconds)
Windows

(tested on Win7, 64bit by @rcrowder :+1: )

sp_tp_profile branch code took 16.754 seconds
remove_asm branch code took 16.6329 seconds

scottpurdy commented 9 years ago

@breznak - so you are making this judgement based on testing on a single platform? Can you say what the platform you ran this was?

breznak commented 9 years ago

@scottpurdy true that, I missed windows.

I tested on Linux64 (ubuntu 14.10)
asked @rhyolight or anybody if they can test on Darwin64
@david-ragazzi @rcrowder can you please test on Windows64?

To test you need:

compile nupic.core with and without this branch
compile nupic with given (local) nupic.core
compare results from python -m cProfile --sort cumtime https://github.com/breznak/nupic/blob/bench_tools/tp_large.py
compare results from python -m cProfile --sort cumtime $NUPIC/scripts/runOpfExperiment.py /path/model/with/large/data (I used dataset with ~20k records, model with SP,TP,2048 cols)

exact times do not matter that much, difference between the branches is important. Note the results are only approximate, as we use just 1 run, and system load can affect that too.

rhyolight commented 9 years ago

asked @rhyolight or anybody if they can test on Darwin64

I think I missed that request. I'll try to do it today in the background.

rhyolight commented 9 years ago

(I used dataset with ~20k records, model with SP,TP,2048 cols)

Can I use the same data file? Or is there one in the repo that is sufficient?

rhyolight commented 9 years ago

OS X 10.10 results of python -m cProfile --sort cumtime tp_large.py

remove_asm (52.253 seconds)
master (53.436 seconds)

breznak commented 9 years ago

:+1: great, thank you @rhyolight ! I'm not sure I can share that data, but any file will do, eg I created http://uloz.to/xrFj9Kkr/data-csv with ~21k records. If you can try that too, please.

rhyolight commented 9 years ago

@breznak Just making sure I do this right... I'll need to swarm over that file to get a description.py, right? Then run_opf_experiment.py on the description.py?

breznak commented 9 years ago

@rhyolight no need for swarming, you can use any description.py, results are not so important, we just need to crunch the data.

You can use description.py provided at http://uloz.to/x4dumoVd/description-py it assumes:

/tmp/data.csv (from the prev link)
/tmp/swarm/description.py

to run python -m cProfile --sort cumtime $NUPIC/scripts/run_opf_experiment.py /tmp/swarm/

scottpurdy commented 9 years ago

There was obviously a lot of time spent on the asm code so there must have been a big benefit. Perhaps we should try updating it for 64 bit to see if we get a big improvement? It looks like the checks are specifically for 32 bit which is why you wouldn't see a difference.

I agree with the sentiment that inline assembler is really messy though so would be ok dropping it even without testing a 64 bit version if that is the consensus (curious what @subutai thinks).

breznak commented 9 years ago

the ASM code is for both 32/64, https://github.com/numenta/nupic.core/pull/383/files#diff-5fc4121d5c281de47186c9f30e425bdbL319 (and one more) ...I would have expected big difference as well, but there shows to be almost none, I think the compilers have advanced rapidly.

rhyolight commented 9 years ago

Thanks a bunch, @breznak. Here are the results of the 2nd test:

remove_asm (157.513 seconds)
master (159.679 seconds)

rcrowder commented 9 years ago

@breznak https://github.com/breznak Although we have the core library running on Windows. Use of advanced language features prevents the current build from working with Python (2.7 and 3.x). I could translate the tp_large Python to C++ for comparison? Running OPF is a while away.

On Tue, Mar 3, 2015 at 2:23 AM, Matthew Taylor notifications@github.com wrote:

Thanks a bunch, @breznak https://github.com/breznak. Here are the results of the 2nd test:

remove_asm https://gist.github.com/rhyolight/e9326cf694d9309411f1 (157.513 seconds)

master https://gist.github.com/rhyolight/3a1ffa02cc02ebc1099b (159.679 seconds)

— Reply to this email directly or view it on GitHub https://github.com/numenta/nupic.core/issues/381#issuecomment-76874784.

breznak commented 9 years ago

@rcrowder I already started translating tp_large to c++, it would be awesome if you could help me on https://github.com/numenta/nupic.core/pull/388 - the tp.compute() is causing a segfault and I dont know why. Ah, I see about the Win-python nupic part, I was hoping to run OPF natively soon. Hope you guys crack it like you did with nupic.core! :+1: :rice_cracker:

breznak commented 9 years ago

Here are the results of the 2nd test:

@rhyolight awsome! thank you very much for testing it out, Matt!

breznak commented 9 years ago

Tested with https://github.com/numenta/nupic.core/pull/388 on Linux64, full optimizations and speed difference is none (slightly in favor of the cleanup).

@rhyolight @scottpurdy @subutai -- the performance seems the same + the portability+cleanup, are there any objections to this PR?

breznak commented 9 years ago

@rcrowder could you do me one more favor? :pray: to test on Windows this PR vs. master on test from https://github.com/numenta/nupic.core/pull/388

rcrowder commented 9 years ago

@breznak I added a nupic::Timer and ran the 64 bit Release version of the test. Run under 64 bit Win7, a single core, and peaked at approx 65 megabytes.

sp_tp_profile branch code took 16.754 seconds
remove_asm branch code took 16.6329 seconds

rhyolight commented 9 years ago

It looks like there is no performance degredation across the board when removing ASM. Nice! :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner: :runner:

scottpurdy commented 9 years ago

Good work guys. But it is hard for me to tell if we are actually checking the time of the code in question. How exactly does the assembler code get called in your experiments? Did you actually verify that it gets called?

breznak commented 9 years ago

I added a nupic::Timer and ran the 64 bit Release version of the test.

@rcrowder great, thanks a bunch! :+1: Btw, if you want to make a PR to the sp_tp_profile branch with your Timer code, it will be welcome

breznak commented 9 years ago

But it is hard for me to tell if we are actually checking the time of the code in question. How exactly does the assembler code get called in your experiments? Did you actually verify that it gets called?

@scottpurdy good question indeed! :grey_question:

for a quick overview, the code that imports that class:

$ grep -R 'ArrayAlgo.hpp' ../../src/
../../src/nupic/math/StlIo.hpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/nupic/math/SparseBinaryMatrix.hpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/nupic/math/NearestNeighbor.hpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/nupic/math/SparseTensor.hpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/nupic/math/SparseMatrix.hpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/nupic/algorithms/Segment.hpp:#include <nupic/math/ArrayAlgo.hpp> // is_sorted
../../src/nupic/algorithms/Segment.cpp:#include <nupic/math/ArrayAlgo.hpp> // is_in
../../src/nupic/algorithms/Svm.hpp:#include <nupic/math/ArrayAlgo.hpp> // for int checkSSE()
../../src/nupic/algorithms/Cells4.cpp:#include <nupic/math/ArrayAlgo.hpp> // is_in
../../src/test/unit/math/MathsTest.cpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/test/unit/math/IndexUnitTest.cpp:#include <nupic/math/ArrayAlgo.hpp>
../../src/test/unit/math/SparseMatrix01UnitTest.cpp:#include <nupic/math/ArrayAlgo.hpp>

it looks like it's used for Sparse*Matrix & slightly in TP (Cells4, Segment), these classes (SP,TP) are tested in the benchmarks we used (tp_large.py, CLAModel run(), sp_tp_profile example from the other PR

usefulness: this is the most heavy-weight optimized code in our codebase, and still, if its removal shows no degradation in speed of the two most costly(in time) operations (TP.compute() & SP.compute()) then what's the point of having it? Either it's not (or marginally) used, or the optimizations are not so big.

In summary, imho still holds: ~0% speed, ++portability & cleanness.

rcrowder commented 9 years ago

@breznak Rather than pull a PR off of this PR, I've submitted the following change - https://github.com/rcrowder/nupic.core/commit/45be0ffaf701eded5ad5d763a6de693bcb074406 The additional fives lines are; 32, 44-45, and 91-92

scottpurdy commented 9 years ago

@breznak - You have shown that the file is imported, but not that the ASM code is actually used! There are several ways to do this. You could, for instance, put a logging statement inside the preprocessor conditional. Then if you see that in the logging output then you know you are executing the code in question. But it could be that either the function isn't being called or the preprocessor conditional isn't being evaluated as you expect.

breznak commented 9 years ago

Ok, I can test with that. My argumentation was, if the code were not used in this scenario, I can't see much more where it could be useful (not in SP/TP compute() , so where?)

breznak commented 9 years ago

@scottpurdy the code was not used by anything, so I removed the methods (along with a bunch of others, but I'd like to keep this PR about the ASM, so I'm willing to revert the last commits), please have a look.

scottpurdy commented 9 years ago

I'm not prepared to approve removing the assembler without hearing what @subutai thinks. It would be one thing to show that the assembler isn't actually faster but if we are instead making the case that it isn't use and thus can be removed then I don't think that is sufficient without considering if it will be useful to have in the future or what the intention of it was. Perhaps we should be using the code and then there would be a difference in the benchmarks.

rhyolight commented 9 years ago

@subutai Can you chime in here?

subutai commented 9 years ago

@breznak Sorry for late response - was traveling a lot the last two weeks. I'm a bit confused what PR I should look at. Which is the latest one with the ASM removal? And which one has the actual methods removed? Thanks.

breznak commented 9 years ago

Hi @subutai ! no problem :) Please see #383 , it's the only one PR on this matter, with the ASM removed.

rhyolight commented 9 years ago

@breznak So we have both @subutai and @scottpurdy hinting that this optimization may be premature. Can we put it on the back burner and try again after releasing 1.0?

breznak commented 9 years ago

yep, sure!

rhyolight commented 9 years ago

@breznak Thanks for understanding.

rhyolight commented 8 years ago

Issue closed due to inactivity

This issue has had no activity for a long time period, so it has been closed. If you believe it was closed in error, re-open the issue and leave a comment stating a justification for it.

numenta / nupic.core-legacy

remove ASM code with no benefits #381

Results:

LINUX:

OSX

Windows

Issue closed due to inactivity