gonum / matrix

Matrix packages for the Go language [DEPRECATED]
446 stars 53 forks source link

mat64: reduce number of comparisons for bounds checks #398

Closed kortschak closed 7 years ago

kortschak commented 7 years ago

@btracey @vladimir-ch Please take a look.

Benchmarks for justification in preparation.

Approach prompted by comment by gri in the tables CL currently in review.

kortschak commented 7 years ago

There's a lot of noise.

name                               old time/op  new time/op  delta
CholeskySmall-4                    2.40µs ± 4%  2.50µs ± 3%   +4.34%    (p=0.002 n=9+9)
CholeskyMedium-4                    421µs ± 2%   419µs ± 2%     ~       (p=0.222 n=9+9)
CholeskyLarge-4                     154ms ± 2%   154ms ± 2%     ~     (p=0.684 n=10+10)
MulDense100Half-4                   276µs ± 7%   301µs ±10%   +9.06%  (p=0.000 n=10+10)
MulDense100Tenth-4                  105µs ± 6%   114µs ±13%   +8.24%   (p=0.002 n=9+10)
MulDense1000Half-4                  130ms ± 0%   130ms ± 1%     ~     (p=0.579 n=10+10)
MulDense1000Tenth-4                43.6ms ± 1%  43.8ms ± 1%   +0.52%   (p=0.022 n=10+9)
MulDense1000Hundredth-4            18.7ms ± 2%  18.8ms ± 3%     ~     (p=0.971 n=10+10)
MulDense1000Thousandth-4           13.5ms ± 4%  13.2ms ± 2%   -2.14%  (p=0.015 n=10+10)
PreMulDense100Half-4                274µs ± 5%   280µs ±12%     ~     (p=0.579 n=10+10)
PreMulDense100Tenth-4              82.1µs ± 9%  84.9µs ±11%     ~     (p=0.436 n=10+10)
PreMulDense1000Half-4               128ms ± 1%   128ms ± 1%     ~      (p=0.780 n=10+9)
PreMulDense1000Tenth-4             42.7ms ± 2%  42.8ms ± 1%     ~      (p=0.720 n=9+10)
PreMulDense1000Hundredth-4         17.6ms ± 3%  17.5ms ± 1%     ~     (p=0.796 n=10+10)
PreMulDense1000Thousandth-4        12.0ms ± 2%  11.9ms ± 1%     ~       (p=0.730 n=9+9)
Row10-4                            64.9ns ± 1%  64.7ns ± 0%     ~     (p=0.086 n=10+10)
Row100-4                           84.5ns ± 1%  83.3ns ± 0%   -1.44%  (p=0.000 n=10+10)
Row1000-4                           252ns ± 0%   256ns ± 0%   +1.80%   (p=0.000 n=9+10)
Exp10-4                            34.3µs ± 0%  34.3µs ± 0%     ~      (p=0.108 n=10+9)
Exp100-4                           6.24ms ± 8%  6.40ms ± 2%     ~      (p=0.211 n=10+9)
Exp1000-4                           2.49s ± 4%   2.46s ± 3%     ~     (p=0.190 n=10+10)
Pow10_3-4                          4.62µs ± 0%  4.60µs ± 0%   -0.52%   (p=0.000 n=9+10)
Pow100_3-4                          837µs ±11%   844µs ± 7%     ~     (p=0.796 n=10+10)
Pow1000_3-4                         449ms ±12%   413ms ± 2%   -8.01%    (p=0.006 n=9+8)
Pow10_4-4                          6.53µs ± 0%  6.50µs ± 0%   -0.42%    (p=0.000 n=9+9)
Pow100_4-4                         1.49ms ± 4%  1.26ms ± 5%  -15.08%  (p=0.000 n=10+10)
Pow1000_4-4                         636ms ± 4%   620ms ± 3%     ~       (p=0.050 n=9+9)
Pow10_5-4                          6.56µs ± 1%  6.50µs ± 0%   -0.84%   (p=0.000 n=9+10)
Pow100_5-4                         1.46ms ± 3%  1.28ms ±12%  -12.85%    (p=0.000 n=9+9)
Pow1000_5-4                         639ms ± 3%   623ms ± 3%   -2.49%   (p=0.028 n=10+9)
Pow10_6-4                          8.60µs ± 1%  8.52µs ± 0%   -0.99%   (p=0.000 n=9+10)
Pow100_6-4                         1.95ms ± 2%  1.69ms ±11%  -13.36%   (p=0.000 n=9+10)
Pow1000_6-4                         848ms ± 2%   836ms ± 2%   -1.46%    (p=0.027 n=8+9)
Pow10_7-4                          8.57µs ± 3%  8.42µs ± 0%   -1.75%   (p=0.000 n=10+9)
Pow100_7-4                         1.98ms ± 2%  1.64ms ±12%  -17.27%  (p=0.000 n=10+10)
Pow1000_7-4                         850ms ± 3%   837ms ± 4%     ~      (p=0.122 n=8+10)
Pow10_8-4                          10.5µs ± 2%  10.3µs ± 0%   -1.72%   (p=0.001 n=10+9)
Pow100_8-4                         2.49ms ± 4%  2.17ms ± 7%  -12.58%  (p=0.000 n=10+10)
Pow1000_8-4                         1.10s ± 8%   1.04s ± 1%   -5.69%    (p=0.003 n=9+9)
Pow10_9-4                          8.46µs ± 1%  8.48µs ± 1%     ~       (p=0.213 n=9+9)
Pow100_9-4                         1.99ms ± 3%  1.66ms ± 9%  -16.56%  (p=0.000 n=10+10)
Pow1000_9-4                         1.07s ±19%   0.83s ± 1%  -21.84%   (p=0.000 n=10+9)
MulTransDense100Half-4              490µs ± 3%   419µs ± 7%  -14.40%  (p=0.000 n=10+10)
MulTransDense100Tenth-4             490µs ± 2%   423µs ± 4%  -13.76%  (p=0.000 n=10+10)
MulTransDense1000Half-4             242ms ±10%   196ms ± 2%  -18.89%   (p=0.000 n=10+9)
MulTransDense1000Tenth-4            214ms ±12%   197ms ± 1%   -8.10%  (p=0.000 n=10+10)
MulTransDense1000Hundredth-4        214ms ±11%   198ms ± 0%   -7.38%    (p=0.000 n=9+9)
MulTransDense1000Thousandth-4       235ms ±21%   199ms ± 0%  -15.21%   (p=0.000 n=10+7)
MulTransDenseSym100Half-4           459µs ± 3%   413µs ± 4%   -9.85%   (p=0.000 n=10+9)
MulTransDenseSym100Tenth-4          460µs ± 3%   427µs ± 4%   -7.27%  (p=0.000 n=10+10)
MulTransDenseSym1000Half-4          222ms ±19%   200ms ± 2%   -9.97%   (p=0.000 n=10+9)
MulTransDenseSym1000Tenth-4         231ms ±12%   199ms ± 0%  -13.70%   (p=0.000 n=10+8)
MulTransDenseSym1000Hundredth-4     227ms ± 7%   199ms ± 0%  -12.23%    (p=0.000 n=9+8)
MulTransDenseSym1000Thousandth-4    255ms ± 6%   200ms ± 1%  -21.64%    (p=0.000 n=9+9)
InnerSmSm-4                         202ns ± 2%   207ns ± 0%   +2.27%   (p=0.000 n=10+9)
InnerMedMed-4                      5.98µs ± 2%  5.84µs ± 0%   -2.24%   (p=0.000 n=10+8)
InnerLgLg-4                         714µs ±16%   648µs ± 1%   -9.20%   (p=0.000 n=10+9)
InnerLgSm-4                        15.4µs ± 3%  15.3µs ± 0%   -1.03%    (p=0.003 n=9+9)
MarshalDense10-4                    135ns ± 4%   135ns ± 4%     ~     (p=0.864 n=10+10)
MarshalDense100-4                   877ns ± 4%   868ns ± 3%     ~      (p=0.250 n=9+10)
MarshalDense1000-4                 7.90µs ± 4%  7.85µs ± 3%     ~     (p=0.684 n=10+10)
MarshalDense10000-4                72.8µs ± 4%  73.0µs ± 1%     ~     (p=0.869 n=10+10)
UnmarshalDense10-4                  129ns ± 5%   127ns ± 4%     ~     (p=0.091 n=10+10)
UnmarshalDense100-4                 835ns ± 6%   835ns ± 3%     ~     (p=1.000 n=10+10)
UnmarshalDense1000-4               7.20µs ± 2%  7.31µs ± 3%   +1.45%   (p=0.016 n=8+10)
UnmarshalDense10000-4              67.8µs ± 5%  67.7µs ± 2%     ~     (p=0.912 n=10+10)
MarshalToDense10-4                  149ns ± 2%   149ns ± 1%     ~      (p=0.308 n=10+9)
MarshalToDense100-4                1.03µs ± 2%  1.03µs ± 2%     ~      (p=0.706 n=9+10)
MarshalToDense1000-4               9.73µs ± 1%  9.66µs ± 0%   -0.65%  (p=0.000 n=10+10)
MarshalToDense10000-4              96.7µs ± 1%  96.1µs ± 0%   -0.56%    (p=0.002 n=9+9)
UnmarshalFromDense10-4              353ns ± 1%   369ns ± 2%   +4.39%  (p=0.000 n=10+10)
UnmarshalFromDense100-4            2.58µs ± 4%  2.58µs ± 2%     ~     (p=0.927 n=10+10)
UnmarshalFromDense1000-4           24.0µs ± 2%  24.1µs ± 3%     ~     (p=0.353 n=10+10)
UnmarshalFromDense10000-4           237µs ± 4%   237µs ± 3%     ~     (p=0.631 n=10+10)
MarshalVector10-4                   128ns ± 3%   129ns ± 2%     ~     (p=0.261 n=10+10)
MarshalVector100-4                  861ns ± 2%   856ns ± 2%     ~     (p=0.540 n=10+10)
MarshalVector1000-4                8.54µs ± 3%  8.53µs ± 2%     ~      (p=0.968 n=10+9)
MarshalVector10000-4               83.1µs ± 3%  81.1µs ± 3%   -2.44%  (p=0.009 n=10+10)
UnmarshalVector10-4                 121ns ± 6%   121ns ± 2%     ~     (p=0.928 n=10+10)
UnmarshalVector100-4                827ns ± 3%   816ns ± 3%     ~     (p=0.183 n=10+10)
UnmarshalVector1000-4              7.38µs ± 2%  7.30µs ± 1%     ~      (p=0.122 n=10+8)
UnmarshalVector10000-4             67.4µs ± 2%  66.8µs ± 3%     ~      (p=0.139 n=8+10)
MarshalToVector10-4                 135ns ± 2%   147ns ± 1%   +9.32%   (p=0.000 n=10+8)
MarshalToVector100-4                946ns ± 1%   954ns ± 2%     ~      (p=0.055 n=9+10)
MarshalToVector1000-4              8.99µs ± 0%  9.22µs ± 1%   +2.55%    (p=0.000 n=9+9)
MarshalToVector10000-4             89.3µs ± 0%  91.4µs ± 0%   +2.41%   (p=0.000 n=8+10)
UnmarshalFromVector10-4             343ns ± 2%   342ns ± 2%     ~      (p=1.000 n=9+10)
UnmarshalFromVector100-4           2.57µs ± 2%  2.56µs ± 3%     ~     (p=0.739 n=10+10)
UnmarshalFromVector1000-4          24.1µs ± 2%  24.2µs ± 2%     ~     (p=0.247 n=10+10)
UnmarshalFromVector10000-4          239µs ± 2%   236µs ± 1%   -1.22%   (p=0.006 n=10+9)
Pool10by10Uncleared-4              60.3ns ± 0%  59.1ns ± 0%   -2.05%    (p=0.000 n=8+7)
Pool10by10Cleared-4                86.4ns ± 2%  86.0ns ± 1%     ~      (p=0.529 n=10+8)
New10by10-4                         433ns ± 2%   434ns ± 3%     ~     (p=1.000 n=10+10)
Pool100by100Uncleared-4            59.5ns ± 2%  59.4ns ± 1%     ~     (p=0.807 n=10+10)
Pool100by100Cleared-4              2.88µs ± 0%  2.87µs ± 0%   -0.17%  (p=0.015 n=10+10)
New100by100-4                      18.9µs ± 1%  19.2µs ± 1%   +1.44%  (p=0.000 n=10+10)
MulWorkspaceDense100Half-4          379µs ± 8%   434µs ±11%  +14.57%  (p=0.000 n=10+10)
MulWorkspaceDense100Tenth-4         373µs ± 7%   416µs ± 9%  +11.52%  (p=0.000 n=10+10)
MulWorkspaceDense1000Half-4         190ms ± 0%   189ms ± 1%   -0.32%    (p=0.031 n=9+9)
MulWorkspaceDense1000Tenth-4        196ms ± 3%   194ms ± 3%     ~     (p=0.393 n=10+10)
MulWorkspaceDense1000Hundredth-4    215ms ± 9%   212ms ± 4%     ~     (p=0.853 n=10+10)
MulWorkspaceDense1000Thousandth-4  14.4ms ± 6%  14.1ms ±21%     ~     (p=0.190 n=10+10)
AddScaledVec10Inc1-4               42.9ns ± 2%  41.9ns ± 1%   -2.29%  (p=0.001 n=10+10)
AddScaledVec100Inc1-4              88.3ns ± 2%  87.8ns ± 0%     ~     (p=0.315 n=10+10)
AddScaledVec1000Inc1-4              564ns ± 3%   558ns ± 0%     ~     (p=0.198 n=10+10)
AddScaledVec10000Inc1-4            7.85µs ± 0%  7.67µs ± 0%   -2.32%  (p=0.000 n=10+10)
AddScaledVec100000Inc1-4            110µs ± 0%   110µs ± 0%     ~      (p=0.278 n=10+9)
AddScaledVec10Inc2-4               55.1ns ± 0%  55.1ns ± 0%     ~       (p=0.082 n=9+9)
AddScaledVec100Inc2-4               231ns ± 1%   231ns ± 0%     ~      (p=0.959 n=10+9)
AddScaledVec1000Inc2-4             2.44µs ± 0%  2.43µs ± 3%     ~      (p=0.162 n=8+10)
AddScaledVec10000Inc2-4            24.2µs ± 0%  24.2µs ± 0%   -0.12%    (p=0.029 n=9+8)
AddScaledVec100000Inc2-4            246µs ± 0%   246µs ± 2%     ~     (p=0.481 n=10+10)
AddScaledVec10Inc20-4              55.1ns ± 0%  55.5ns ± 2%   +0.76%   (p=0.008 n=8+10)
AddScaledVec100Inc20-4              231ns ± 0%   236ns ± 0%   +2.38%   (p=0.000 n=7+10)
AddScaledVec1000Inc20-4            3.18µs ± 0%  3.34µs ± 1%   +5.02%   (p=0.000 n=8+10)
AddScaledVec10000Inc20-4           49.0µs ± 1%  49.9µs ± 1%   +1.84%  (p=0.000 n=10+10)
AddScaledVec100000Inc20-4          1.84ms ± 2%  1.78ms ± 2%   -2.92%    (p=0.000 n=9+9)
ScaleVec10Inc1-4                   18.8ns ± 1%  19.0ns ± 2%     ~     (p=0.212 n=10+10)
ScaleVec100Inc1-4                  43.1ns ± 3%  41.9ns ± 0%   -2.68%   (p=0.008 n=10+7)
ScaleVec1000Inc1-4                  351ns ± 8%   363ns ± 0%     ~     (p=0.179 n=10+10)
ScaleVec10000Inc1-4                4.53µs ± 2%  4.52µs ± 1%     ~      (p=0.951 n=10+9)
ScaleVec100000Inc1-4               75.8µs ± 3%  75.3µs ± 0%     ~      (p=0.400 n=10+9)
ScaleVec10Inc2-4                   24.4ns ± 3%  24.0ns ± 0%   -1.62%   (p=0.010 n=10+9)
ScaleVec100Inc2-4                   132ns ± 1%   128ns ± 0%   -2.87%    (p=0.000 n=9+9)
ScaleVec1000Inc2-4                 1.23µs ± 2%  1.22µs ± 0%     ~      (p=0.458 n=10+9)
ScaleVec10000Inc2-4                12.5µs ± 1%  12.2µs ± 0%   -2.06%  (p=0.000 n=10+10)
ScaleVec100000Inc2-4                142µs ± 0%   143µs ± 0%   +0.47%    (p=0.000 n=8+9)
ScaleVec10Inc20-4                  23.9ns ± 1%  24.5ns ± 1%   +2.21%    (p=0.000 n=8+9)
ScaleVec100Inc20-4                  130ns ± 3%   132ns ± 3%     ~     (p=0.072 n=10+10)
ScaleVec1000Inc20-4                1.74µs ± 2%  1.73µs ± 1%     ~      (p=0.483 n=10+9)
ScaleVec10000Inc20-4               26.9µs ± 2%  26.9µs ± 5%     ~      (p=0.447 n=9+10)
ScaleVec100000Inc20-4               852µs ± 3%   823µs ± 1%   -3.46%  (p=0.000 n=10+10)
AddVec10Inc1-4                     35.3ns ± 0%  36.3ns ± 1%   +2.80%   (p=0.000 n=7+10)
AddVec100Inc1-4                    83.1ns ± 0%  85.3ns ± 0%   +2.60%    (p=0.000 n=8+8)
AddVec1000Inc1-4                    554ns ± 1%   569ns ± 1%   +2.72%   (p=0.000 n=9+10)
AddVec10000Inc1-4                  8.19µs ± 0%  7.83µs ± 4%   -4.38%    (p=0.000 n=9+9)
AddVec100000Inc1-4                  111µs ± 1%   110µs ± 0%     ~     (p=0.247 n=10+10)
AddVec10Inc2-4                     54.0ns ± 2%  53.1ns ± 0%   -1.76%  (p=0.001 n=10+10)
AddVec100Inc2-4                     229ns ± 1%   228ns ± 0%     ~     (p=0.211 n=10+10)
AddVec1000Inc2-4                   2.44µs ± 1%  2.42µs ± 2%     ~      (p=0.074 n=9+10)
AddVec10000Inc2-4                  24.8µs ± 3%  24.0µs ± 2%   -3.02%  (p=0.000 n=10+10)
AddVec100000Inc2-4                  251µs ± 3%   244µs ± 2%   -2.71%   (p=0.023 n=10+9)
AddVec10Inc20-4                    53.1ns ± 0%  53.1ns ± 0%     ~       (p=0.459 n=9+9)
AddVec100Inc20-4                    230ns ± 1%   230ns ± 2%     ~     (p=0.559 n=10+10)
AddVec1000Inc20-4                  3.11µs ± 0%  3.26µs ± 0%   +5.04%   (p=0.000 n=9+10)
AddVec10000Inc20-4                 49.3µs ± 2%  48.7µs ± 0%   -1.30%   (p=0.017 n=10+9)
AddVec100000Inc20-4                1.85ms ± 2%  1.81ms ± 1%   -2.50%   (p=0.000 n=10+9)
SubVec10Inc1-4                     35.6ns ± 1%  35.3ns ± 0%   -0.58%    (p=0.025 n=8+9)
SubVec100Inc1-4                    94.6ns ± 1%  95.2ns ± 2%   +0.59%    (p=0.026 n=9+9)
SubVec1000Inc1-4                    553ns ± 0%   557ns ± 1%   +0.64%    (p=0.001 n=8+9)
SubVec10000Inc1-4                  8.34µs ± 4%  7.66µs ± 0%   -8.10%  (p=0.000 n=10+10)
SubVec100000Inc1-4                  115µs ± 4%   113µs ± 0%     ~      (p=0.156 n=10+9)
SubVec10Inc2-4                     53.8ns ± 1%  54.7ns ± 5%     ~      (p=0.127 n=8+10)
SubVec100Inc2-4                     253ns ± 1%   252ns ± 0%   -0.55%   (p=0.016 n=10+8)
SubVec1000Inc2-4                   2.45µs ± 2%  2.43µs ± 0%   -0.72%    (p=0.000 n=9+8)
SubVec10000Inc2-4                  24.7µs ± 2%  24.2µs ± 0%   -1.94%   (p=0.000 n=10+9)
SubVec100000Inc2-4                  256µs ± 5%   246µs ± 0%   -3.77%  (p=0.000 n=10+10)
SubVec10Inc20-4                    53.7ns ± 3%  53.0ns ± 0%   -1.28%   (p=0.002 n=10+9)
SubVec100Inc20-4                    256ns ± 2%   252ns ± 0%   -1.45%  (p=0.013 n=10+10)
SubVec1000Inc20-4                  3.10µs ± 0%  3.26µs ± 0%   +4.93%    (p=0.000 n=9+9)
SubVec10000Inc20-4                 48.6µs ± 0%  48.6µs ± 0%   +0.11%    (p=0.002 n=9+9)
SubVec100000Inc20-4                1.82ms ± 0%  1.83ms ± 2%     ~      (p=0.497 n=9+10)
btracey commented 7 years ago

LGTM.

SSA should eventually be faster, but I've seen a lot of variance. See for example https://github.com/golang/go/issues/14995