Closed owulveryck closed 5 years ago
This looks amazing. That's a crazy diff
I'll fix up the fastCopy
bit and we're good to merge @owulveryck ... this is phenomenal;
@owulveryck I have pushed new code to separate out fastCopy
into a separate method. Wondering if you could run your benchmarks? My benchmark results are as follows (bear in mind my computer is slow)
benchmark old ns/op new ns/op delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8 2873 2705 -5.85%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8 2928 2933 +0.17%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8 4420 3830 -13.35%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8 4058 3739 -7.86%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8 3440 3165 -7.99%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8 4482 3895 -13.10%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8 4310 3933 -8.75%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8 3650 3572 -2.14%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8 3687 3371 -8.57%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8 3573 3313 -7.28%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8 3543 3615 +2.03%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8 4190 4000 -4.53%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8 4502 4217 -6.33%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8 3325 3395 +2.11%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8 3586 3446 -3.90%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8 5245 4451 -15.14%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8 10253 8329 -18.77%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8 4161 3783 -9.08%
BenchmarkDenseRepeat/Nonexistent_axis-8 3648 3455 -5.29%
benchmark old allocs new allocs delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8 17 15 -11.76%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8 17 15 -11.76%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8 24 20 -16.67%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8 24 20 -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8 18 16 -11.11%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8 24 20 -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8 24 20 -16.67%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8 23 19 -17.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8 23 19 -17.39%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8 23 19 -17.39%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8 20 19 -5.00%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8 23 21 -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8 23 21 -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8 17 16 -5.88%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8 17 16 -5.88%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8 27 25 -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8 59 47 -20.34%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8 12 12 +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-8 10 10 +0.00%
benchmark old bytes new bytes delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8 872 744 -14.68%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8 896 768 -14.29%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8 1320 1064 -19.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8 1360 1104 -18.82%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8 968 840 -13.22%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8 1336 1080 -19.16%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8 1360 1104 -18.82%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8 1264 1008 -20.25%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8 1288 1032 -19.88%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8 1304 1048 -19.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8 1160 1096 -5.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8 1328 1200 -9.64%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8 1344 1216 -9.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8 960 896 -6.67%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8 960 896 -6.67%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8 1728 1600 -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8 3833 3064 -20.06%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8 848 848 +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-8 736 736 +0.00%
@owulveryck if you like the results you see from the benchmarks please squash and merge. Thank you so much
I like this new interface !
Here is the result of my bench (with benchcmp
):
benchmark old ns/op new ns/op delta
BenchmarkDense_Mul_Unsafe-4 14441 13450 -6.86%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4 13373 13373 +0.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4 445395 512722 +15.12%
BenchmarkDense_Transpose-4 222218 206483 -7.08%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4 1936 1681 -13.17%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4 1877 1436 -23.49%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4 2726 2136 -21.64%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4 3059 2304 -24.68%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4 2499 1680 -32.77%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4 3188 2124 -33.38%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4 3201 2073 -35.24%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4 2652 2866 +8.07%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4 2460 3715 +51.02%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4 2998 2265 -24.45%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4 2405 2027 -15.72%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4 3647 2559 -29.83%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4 4537 2240 -50.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4 6135 1789 -70.84%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4 3798 1882 -50.45%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4 6804 3065 -54.95%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4 15772 4831 -69.37%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4 4608 4003 -13.13%
BenchmarkDenseRepeat/Nonexistent_axis-4 4310 3632 -15.73%
benchmark old allocs new allocs delta
BenchmarkDense_Mul_Unsafe-4 1 1 +0.00%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4 1 1 +0.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4 5 5 +0.00%
BenchmarkDense_Transpose-4 11 11 +0.00%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4 17 15 -11.76%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4 17 15 -11.76%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4 24 20 -16.67%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4 24 20 -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4 18 16 -11.11%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4 24 20 -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4 24 20 -16.67%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4 23 19 -17.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4 23 19 -17.39%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4 23 19 -17.39%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4 20 19 -5.00%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4 23 21 -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4 23 21 -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4 17 16 -5.88%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4 17 16 -5.88%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4 27 25 -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4 59 47 -20.34%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4 12 12 +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-4 10 10 +0.00%
benchmark old bytes new bytes delta
BenchmarkDense_Mul_Unsafe-4 11 9 -18.18%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4 10 12 +20.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4 396 396 +0.00%
BenchmarkDense_Transpose-4 20982 20979 -0.01%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4 872 744 -14.68%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4 896 768 -14.29%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4 1320 1064 -19.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4 1360 1104 -18.82%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4 968 840 -13.22%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4 1336 1080 -19.16%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4 1360 1104 -18.82%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4 1264 1008 -20.25%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4 1288 1032 -19.88%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4 1304 1048 -19.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4 1160 1096 -5.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4 1328 1200 -9.64%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4 1344 1216 -9.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4 960 896 -6.67%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4 960 896 -6.67%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4 1728 1600 -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4 3832 3064 -20.04%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4 848 848 +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-4 736 736 +0.00%
The
denseRepeat
method of the standard engine is callingcopyDenseSlice
method in two embedded for loops. This method is slicing the source and destination at each pass, even if the start and end indices of the source did not change.This PR is optimizing this by avoiding the call to copyDenseSlice if possible and avoiding to create a slice array if it's not needed. The
Memcpy
method is called directly from thedenseRepeat
method to avoid any time loss by passing values or references to the function. On top of being less greedy with the memory, this change also reduces the object creation and by consequence the pressure on the GC.This commit includes a couple of benchmarks made from the tests.
The comparison gives the following results: