#9 Fix python benchmarks

vsbogd commented 6 years ago

Python benchmarks are implemented using Guile implementation as reference. Changes in Python bindings API are applied.

Link https://github.com/opencog/benchmark/pull/12/files?w=1 ignores spaces and makes code alignment is more relevant.

Python results:

$ ./atomspace/atomspace/atomspace_bm -A -c
OpenCog Atomspace Benchmark - Version 1.0.1

Random generator: MT19937
Random seed: 1527164403

Benchmarking Python's noop method 400000 times ..........
0.023060 seconds elapsed (17346170.39 per second)
Sum clock() time for all requests: 131 (0.000131 seconds, 3.05344e+09 requests per second)
------------------------------
Benchmarking Python's getType method 400000 times ..........
7.586225 seconds elapsed (52727.15 per second)
Sum clock() time for all requests: 7004312 (7.00431 seconds, 57107.7 requests per second)
------------------------------
Benchmarking Python's getTruthValue method 400000 times ..........
8.982693 seconds elapsed (44530.08 per second)
Sum clock() time for all requests: 8596201 (8.5962 seconds, 46532.2 requests per second)
------------------------------
Benchmarking Python's setTruthValue method 400000 times ..........
10.471440 seconds elapsed (38199.14 per second)
Sum clock() time for all requests: 9861954 (9.86195 seconds, 40559.9 requests per second)
------------------------------
Benchmarking Python's pointerCast method 400000 times ..........
0.190754 seconds elapsed (2096942.31 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking Python's getIncomingSet method 400000 times ..........
8.424081 seconds elapsed (47482.92 per second)
Sum clock() time for all requests: 8034402 (8.0344 seconds, 49785.9 requests per second)
------------------------------
Benchmarking Python's getIncomingSetSize method 400000 times ..........
0.187442 seconds elapsed (2133992.72 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking Python's getOutgoingSet method 400000 times ..........
7.629716 seconds elapsed (52426.59 per second)
Sum clock() time for all requests: 7233621 (7.23362 seconds, 55297.3 requests per second)
------------------------------
Benchmarking Python's addNode method 400000 times ..........
13.503772 seconds elapsed (29621.35 per second)
Sum clock() time for all requests: 13077400 (13.0774 seconds, 30587.1 requests per second)
------------------------------
Benchmarking Python's addLink method 400000 times ..........
22.010627 seconds elapsed (18173.04 per second)
Sum clock() time for all requests: 21142408 (21.1424 seconds, 18919.3 requests per second)
------------------------------
Benchmarking Python's removeAtom method 194000 times ...........
4.535461 seconds elapsed (42774.04 per second)
Sum clock() time for all requests: 3858507 (3.85851 seconds, 50278.5 requests per second)
------------------------------
Benchmarking Python's getHandlesByType method 400000 times ..........
2.239153 seconds elapsed (178638.98 per second)
Sum clock() time for all requests: 4474066000 (4474.07 seconds, 89.4041 requests per second)
------------------------------
Benchmarking Python's push_back method 400000 times ..........
0.000836 seconds elapsed (478528693.67 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking Python's push_back_reserve method 400000 times ..........
0.000828 seconds elapsed (483214746.54 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking Python's emplace_back method 400000 times ..........
0.000830 seconds elapsed (481965412.24 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking Python's emplace_back_reserve method 400000 times ..........
0.000840 seconds elapsed (476219585.58 per  #second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking Python's reserve method 400000 times ..........
0.000837 seconds elapsed (477847223.01 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------

C++ results:

$ ./atomspace/atomspace/atomspace_bm -A
OpenCog Atomspace Benchmark - Version 1.0.1

Random generator: MT19937
Random seed: 1527164611

Benchmarking AtomSpace's noop method 400000 times ..........
0.022912 seconds elapsed (17458081.17 per second)
Sum clock() time for all requests: 136 (0.000136 seconds, 2.94118e+09 requests per second)
------------------------------
Benchmarking AtomSpace's getType method 400000 times ..........
0.175667 seconds elapsed (2277034.91 per second)
Sum clock() time for all requests: 2931 (0.002931 seconds, 1.36472e+08 requests per second)
------------------------------
Benchmarking AtomSpace's getTruthValue method 400000 times ..........
0.236796 seconds elapsed (1689216.72 per second)
Sum clock() time for all requests: 59137 (0.059137 seconds, 6.76396e+06 requests per second)
------------------------------
Benchmarking AtomSpace's setTruthValue method 400000 times ..........
0.606672 seconds elapsed (659334.81 per second)
Sum clock() time for all requests: 401896 (0.401896 seconds, 995282 requests per second)
------------------------------
Benchmarking AtomSpace's pointerCast method 400000 times ..........
0.212977 seconds elapsed (1878137.67 per second)
Sum clock() time for all requests: 24430 (0.02443 seconds, 1.63733e+07 requests per second)
------------------------------
Benchmarking AtomSpace's getIncomingSet method 400000 times ..........
0.367629 seconds elapsed (1088053.29 per second)
Sum clock() time for all requests: 176175 (0.176175 seconds, 2.27047e+06 requests per second)
------------------------------
Benchmarking AtomSpace's getIncomingSetSize method 400000 times ..........
0.295474 seconds elapsed (1353756.77 per second)
Sum clock() time for all requests: 81679 (0.081679 seconds, 4.89722e+06 requests per second)
------------------------------
Benchmarking AtomSpace's getOutgoingSet method 400000 times ..........
0.184256 seconds elapsed (2170891.77 per second)
Sum clock() time for all requests: 4991 (0.004991 seconds, 8.01443e+07 requests per second)
------------------------------
Benchmarking AtomSpace's addNode method 400000 times ..........
2.152410 seconds elapsed (185838.22 per second)
Sum clock() time for all requests: 1866968 (1.86697 seconds, 214251 requests per second)
------------------------------
Benchmarking AtomSpace's addLink method 400000 times ..........
4.436075 seconds elapsed (90169.80 per second)
Sum clock() time for all requests: 3860118 (3.86012 seconds, 103624 requests per second)
------------------------------
Benchmarking AtomSpace's removeAtom method 194000 times ...........
0.729076 seconds elapsed (266090.26 per second)
Sum clock() time for all requests: 389898 (0.389898 seconds, 497566 requests per second)
------------------------------
Benchmarking AtomSpace's getHandlesByType method 400000 times ..........
1.210773 seconds elapsed (330367.46 per second)
Sum clock() time for all requests: 2257008000 (2257.01 seconds, 177.226 requests per second)
------------------------------
Benchmarking AtomSpace's push_back method 400000 times ..........
0.000837 seconds elapsed (477847223.01 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking AtomSpace's push_back_reserve method 400000 times ..........
0.000843 seconds elapsed (474468778.28 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking AtomSpace's emplace_back method 400000 times ..........
0.000864 seconds elapsed (462947461.37 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking AtomSpace's emplace_back_reserve method 400000 times ..........
0.000894 seconds elapsed (447511763.14 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------
Benchmarking AtomSpace's reserve method 400000 times ..........
0.000886 seconds elapsed (451485898.82 per second)
Sum clock() time for all requests: 0 (0 seconds, inf requests per second)
------------------------------

linas commented 6 years ago

Looks perfect to me!

FWIW, your C++ performance figures are comparable to mine; I have a slower CPU than you but probably a much bigger cache. So, your performance results for "small things" that easily fit into cache are better than mine: getTV, pointerCast, getOutgoingSet. I get better figures when things don't fit in your cache, but do fit in mine: addNode addLink removeAtom (viz, when comparing to the numbers at the bottom of diary.txt)

(BTW, if you ever feel the urge to keep a diary, don't edit diary.txt but instead start a new one. I should probably put my name on top of the existing one, since the results are specific to cpu, version numbers, etc.)

linas commented 6 years ago

The python numbers are pretty much exactly as expected, and are presumably within a factor of 2x of what you're measuring for guile.

vsbogd commented 6 years ago

Great, thank you for analysis!

Yes I should keep my performance results to use them as baseline while changing benchmarks or code. I will add my dairy as separate file according your suggestion.

I though about dairy.txt and about a way to fill it automatically reflecting test system configuration. I think it would be great to have dairy format which is grepable by some machine id or system parameters or both and suitable for automatic performance changes analysis.

linas commented 6 years ago

The benchmark is extremely noisy -- if you run it five times in a row, you will get numbers that can differ by a lot. I do not really understand why it is so noisy. The diary.txt numbers are usually best-out-of-three or best-out-of-five.

Once upon a time I used to get better numbers if I also ran some other CPU-intensive process at the same time, rather than benchmarking on an idle system. I never fully understod that, either.

Thus, "fully automated" reporting is problematic .. first, the noise washes out a lot of small changes, and anyway - one has to understand why things changed. Certain necessary changes usually slow things down; other performance tunings usually speed things up.

I also see a HUGE variation of numbers that reflect cache-boundary, cache-contention, and cache-address-aliasing issues (and ditto for the TLB's) -- altering sizes, alignments, forcing aliasing can result in performance changes of 2x better or worse. This makes if very hard to get consistent performance measurements, and to consistently evaluate the result of code-changes. A lot of the mystery changes (numbers that change, but should not have) are almost surely due to these cache and TLB issues.

linas commented 6 years ago

Cache: for example, I'm playing with your new changes, right now. I made a "minor" change, that should affect only link insertion -- I reserved a size for std::vector because reserving works faster than just push_back.

atom remove performance should not change at all. Yet it jumped from an average of 350K (average of three runs) to 400K (average of three runs). The best of the old three runs was 351779 the worst of the new three runs is 393171 -- a huge change for something where the C++ code is unchanged. However, the code-alignment in the shared library changes -- suggesting that there is either an I-cache aliasing issue, or an I-TLB aliasing issue (viz, mine are 4-way set-associative so if the measurement loop accidentally causes 5 pages to alias to the same address, then there will always be one page eviction per loop! This kind of stuff has been driving me nuts; I see no practical way of controlling for it.)

opencog / benchmark

#9 Fix python benchmarks #12