gorgonia / tensor

package tensor provides efficient and generic n-dimensional arrays in Go that are useful for machine learning and deep learning purposes
Apache License 2.0
362 stars 49 forks source link

Optimization on the Repeat operation #43

Closed owulveryck closed 5 years ago

owulveryck commented 5 years ago

The denseRepeat method of the standard engine is calling copyDenseSlice method in two embedded for loops. This method is slicing the source and destination at each pass, even if the start and end indices of the source did not change.

This PR is optimizing this by avoiding the call to copyDenseSlice if possible and avoiding to create a slice array if it's not needed. The Memcpy method is called directly from the denseRepeat method to avoid any time loss by passing values or references to the function. On top of being less greedy with the memory, this change also reduces the object creation and by consequence the pressure on the GC.

This commit includes a couple of benchmarks made from the tests.

The comparison gives the following results:

benchmark                     old ns/op     new ns/op     delta
BenchmarkDense_Repeat0-4      1771          1718          -2.99%
BenchmarkDense_Repeat1-4      1744          1667          -4.42%
BenchmarkDense_Repeat2-4      3045          2381          -21.81%
BenchmarkDense_Repeat3-4      4722          2241          -52.54%
BenchmarkDense_Repeat4-4      1993          1712          -14.10%
BenchmarkDense_Repeat5-4      2685          2159          -19.59%
BenchmarkDense_Repeat6-4      3333          2367          -28.98%
BenchmarkDense_Repeat7-4      2538          1926          -24.11%
BenchmarkDense_Repeat8-4      2906          1969          -32.24%
BenchmarkDense_Repeat9-4      2538          1940          -23.56%
BenchmarkDense_Repeat10-4     2296          1942          -15.42%
BenchmarkDense_Repeat11-4     3224          2153          -33.22%
BenchmarkDense_Repeat12-4     2651          2186          -17.54%
BenchmarkDense_Repeat13-4     1904          1723          -9.51%
BenchmarkDense_Repeat14-4     1984          1707          -13.96%
BenchmarkDense_Repeat15-4     3890          2679          -31.13%
BenchmarkDense_Repeat16-4     8598          4897          -43.04%
BenchmarkDense_Repeat17-4     3373          3486          +3.35%
BenchmarkDense_Repeat18-4     3493          3075          -11.97%

benchmark                     old allocs     new allocs     delta
BenchmarkDense_Repeat0-4      17             15             -11.76%
BenchmarkDense_Repeat1-4      17             15             -11.76%
BenchmarkDense_Repeat2-4      24             20             -16.67%
BenchmarkDense_Repeat3-4      24             20             -16.67%
BenchmarkDense_Repeat4-4      18             16             -11.11%
BenchmarkDense_Repeat5-4      24             20             -16.67%
BenchmarkDense_Repeat6-4      24             20             -16.67%
BenchmarkDense_Repeat7-4      23             19             -17.39%
BenchmarkDense_Repeat8-4      23             19             -17.39%
BenchmarkDense_Repeat9-4      23             19             -17.39%
BenchmarkDense_Repeat10-4     20             19             -5.00%
BenchmarkDense_Repeat11-4     23             21             -8.70%
BenchmarkDense_Repeat12-4     23             21             -8.70%
BenchmarkDense_Repeat13-4     17             16             -5.88%
BenchmarkDense_Repeat14-4     17             16             -5.88%
BenchmarkDense_Repeat15-4     27             25             -7.41%
BenchmarkDense_Repeat16-4     59             47             -20.34%
BenchmarkDense_Repeat17-4     12             12             +0.00%
BenchmarkDense_Repeat18-4     10             10             +0.00%

benchmark                     old bytes     new bytes     delta
BenchmarkDense_Repeat0-4      872           744           -14.68%
BenchmarkDense_Repeat1-4      896           768           -14.29%
BenchmarkDense_Repeat2-4      1320          1064          -19.39%
BenchmarkDense_Repeat3-4      1360          1104          -18.82%
BenchmarkDense_Repeat4-4      968           840           -13.22%
BenchmarkDense_Repeat5-4      1336          1080          -19.16%
BenchmarkDense_Repeat6-4      1360          1104          -18.82%
BenchmarkDense_Repeat7-4      1264          1008          -20.25%
BenchmarkDense_Repeat8-4      1288          1032          -19.88%
BenchmarkDense_Repeat9-4      1304          1048          -19.63%
BenchmarkDense_Repeat10-4     1160          1096          -5.52%
BenchmarkDense_Repeat11-4     1328          1200          -9.64%
BenchmarkDense_Repeat12-4     1344          1216          -9.52%
BenchmarkDense_Repeat13-4     960           896           -6.67%
BenchmarkDense_Repeat14-4     960           896           -6.67%
BenchmarkDense_Repeat15-4     1728          1600          -7.41%
BenchmarkDense_Repeat16-4     3832          3064          -20.04%
BenchmarkDense_Repeat17-4     848           848           +0.00%
BenchmarkDense_Repeat18-4     736           736           +0.00%
coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.07%) to 72.838% when pulling e106ac3537266b1d21d2fa44cafae6950ec0d4fb on test-optim-repeat into 06d5c67d95ed8e521d1d493d3c05495373b9d670 on master.

chewxy commented 5 years ago

This looks amazing. That's a crazy diff

owulveryck commented 5 years ago
chewxy commented 5 years ago

I'll fix up the fastCopy bit and we're good to merge @owulveryck ... this is phenomenal;

chewxy commented 5 years ago

@owulveryck I have pushed new code to separate out fastCopy into a separate method. Wondering if you could run your benchmarks? My benchmark results are as follows (bear in mind my computer is slow)

benchmark                                                                     old ns/op     new ns/op     delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8                                2873          2705          -5.85%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8                                2928          2933          +0.17%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8                                4420          3830          -13.35%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8                                4058          3739          -7.86%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8                                3440          3165          -7.99%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8                                4482          3895          -13.10%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8                                4310          3933          -8.75%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8                              3650          3572          -2.14%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8                              3687          3371          -8.57%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8                              3573          3313          -7.28%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8     3543          3615          +2.03%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8          4190          4000          -4.53%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8          4502          4217          -6.33%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8          3325          3395          +2.11%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8          3586          3446          -3.90%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8      5245          4451          -15.14%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8                          10253         8329          -18.77%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8                 4161          3783          -9.08%
BenchmarkDenseRepeat/Nonexistent_axis-8                                       3648          3455          -5.29%

benchmark                                                                     old allocs     new allocs     delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8                                17             15             -11.76%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8                                17             15             -11.76%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8                                24             20             -16.67%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8                                18             16             -11.11%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8                                24             20             -16.67%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8                              23             19             -17.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8                              23             19             -17.39%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8                              23             19             -17.39%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8     20             19             -5.00%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8          17             16             -5.88%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8          17             16             -5.88%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8      27             25             -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8                          59             47             -20.34%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8                 12             12             +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-8                                       10             10             +0.00%

benchmark                                                                     old bytes     new bytes     delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8                                872           744           -14.68%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8                                896           768           -14.29%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8                                1320          1064          -19.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8                                1360          1104          -18.82%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8                                968           840           -13.22%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8                                1336          1080          -19.16%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8                                1360          1104          -18.82%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8                              1264          1008          -20.25%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8                              1288          1032          -19.88%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8                              1304          1048          -19.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8     1160          1096          -5.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8          1328          1200          -9.64%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8          1344          1216          -9.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8          960           896           -6.67%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8          960           896           -6.67%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8      1728          1600          -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8                          3833          3064          -20.06%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8                 848           848           +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-8                                       736           736           +0.00%
chewxy commented 5 years ago

@owulveryck if you like the results you see from the benchmarks please squash and merge. Thank you so much

owulveryck commented 5 years ago

I like this new interface !

Here is the result of my bench (with benchcmp):

benchmark                                                                     old ns/op     new ns/op     delta
BenchmarkDense_Mul_Unsafe-4                                                   14441         13450         -6.86%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4                                  13373         13373         +0.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4                               445395        512722        +15.12%
BenchmarkDense_Transpose-4                                                    222218        206483        -7.08%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4                                1936          1681          -13.17%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4                                1877          1436          -23.49%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4                                2726          2136          -21.64%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4                                3059          2304          -24.68%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4                                2499          1680          -32.77%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4                                3188          2124          -33.38%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4                                3201          2073          -35.24%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4                              2652          2866          +8.07%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4                              2460          3715          +51.02%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4                              2998          2265          -24.45%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4     2405          2027          -15.72%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4          3647          2559          -29.83%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4          4537          2240          -50.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4          6135          1789          -70.84%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4          3798          1882          -50.45%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4      6804          3065          -54.95%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4                          15772         4831          -69.37%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4                 4608          4003          -13.13%
BenchmarkDenseRepeat/Nonexistent_axis-4                                       4310          3632          -15.73%

benchmark                                                                     old allocs     new allocs     delta
BenchmarkDense_Mul_Unsafe-4                                                   1              1              +0.00%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4                                  1              1              +0.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4                               5              5              +0.00%
BenchmarkDense_Transpose-4                                                    11             11             +0.00%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4                                17             15             -11.76%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4                                17             15             -11.76%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4                                24             20             -16.67%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4                                18             16             -11.11%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4                                24             20             -16.67%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4                              23             19             -17.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4                              23             19             -17.39%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4                              23             19             -17.39%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4     20             19             -5.00%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4          17             16             -5.88%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4          17             16             -5.88%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4      27             25             -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4                          59             47             -20.34%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4                 12             12             +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-4                                       10             10             +0.00%

benchmark                                                                     old bytes     new bytes     delta
BenchmarkDense_Mul_Unsafe-4                                                   11            9             -18.18%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4                                  10            12            +20.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4                               396           396           +0.00%
BenchmarkDense_Transpose-4                                                    20982         20979         -0.01%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4                                872           744           -14.68%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4                                896           768           -14.29%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4                                1320          1064          -19.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4                                1360          1104          -18.82%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4                                968           840           -13.22%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4                                1336          1080          -19.16%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4                                1360          1104          -18.82%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4                              1264          1008          -20.25%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4                              1288          1032          -19.88%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4                              1304          1048          -19.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4     1160          1096          -5.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4          1328          1200          -9.64%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4          1344          1216          -9.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4          960           896           -6.67%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4          960           896           -6.67%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4      1728          1600          -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4                          3832          3064          -20.04%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4                 848           848           +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-4                                       736           736           +0.00%