gonum / internal

Internal routines for the gonum project [DEPRECATED]
21 stars 9 forks source link

asm/f64: Updated scal assembly to wide, pipelined loops. #53

Closed Kunde21 closed 7 years ago

Kunde21 commented 7 years ago

This depends on updates in gonum/internal#52

Cleaned up and widened loops in scal* assembly functions.

Benchmarks vs old asm
Old asmNew asm
time/opdelta
ScalUnitary/ScalUnitary-1-122.89ns ± 7%2.92ns ±13%~(p=0.862 n=19+20)
ScalUnitary/ScalUnitary-2-123.03ns ± 1%2.77ns ± 3%−8.49%(p=0.000 n=17+17)
ScalUnitary/ScalUnitary-3-123.88ns ±12%3.07ns ± 1%−20.74%(p=0.000 n=20+16)
ScalUnitary/ScalUnitary-4-122.87ns ±13%3.41ns ±10%+18.68%(p=0.000 n=19+19)
ScalUnitary/ScalUnitary-5-123.48ns ±19%3.52ns ± 6%+1.29%(p=0.028 n=20+18)
ScalUnitary/ScalUnitary-10-124.75ns ± 3%4.39ns ± 5%−7.56%(p=0.000 n=16+18)
ScalUnitary/ScalUnitary-100-1220.9ns ± 2%17.2ns ± 3%−17.57%(p=0.000 n=16+17)
ScalUnitary/ScalUnitary-500-1279.4ns ± 0%71.8ns ± 9%−9.53%(p=0.000 n=16+16)
ScalUnitary/ScalUnitary-1000-12145ns ± 2%142ns ±14%−2.08%(p=0.031 n=18+20)
ScalUnitary/ScalUnitary-5000-121.04µs ± 0%1.03µs ± 1%−0.65%(p=0.000 n=16+16)
ScalUnitary/ScalUnitary-10000-122.07µs ± 0%2.05µs ± 0%−0.82%(p=0.000 n=16+17)
ScalUnitary/ScalUnitary-50000-1215.4µs ± 3%15.6µs ± 4%~(p=0.178 n=16+18)
ScalUnitaryTo/ScalUnitaryTo-1-123.02ns ± 4%3.09ns ±14%~(p=0.840 n=9+9)
ScalUnitaryTo/ScalUnitaryTo-2-123.57ns ± 7%3.12ns ± 2%−12.65%(p=0.000 n=10+9)
ScalUnitaryTo/ScalUnitaryTo-3-124.07ns ± 1%3.57ns ± 1%−12.23%(p=0.000 n=8+8)
ScalUnitaryTo/ScalUnitaryTo-4-123.33ns ± 0%4.35ns ± 2%+30.76%(p=0.000 n=8+9)
ScalUnitaryTo/ScalUnitaryTo-5-124.05ns ± 0%4.26ns ± 9%+5.14%(p=0.000 n=8+9)
ScalUnitaryTo/ScalUnitaryTo-10-125.38ns ± 8%5.23ns ±26%~(p=0.102 n=10+10)
ScalUnitaryTo/ScalUnitaryTo-100-1217.9ns ± 0%18.1ns ± 2%+1.19%(p=0.001 n=8+8)
ScalUnitaryTo/ScalUnitaryTo-500-1276.8ns ±10%71.8ns ± 2%−6.54%(p=0.000 n=9+9)
ScalUnitaryTo/ScalUnitaryTo-1000-12140ns ± 2%134ns ± 1%−4.63%(p=0.000 n=8+8)
ScalUnitaryTo/ScalUnitaryTo-5000-121.35µs ±15%1.27µs ± 1%−6.07%(p=0.011 n=10+9)
ScalUnitaryTo/ScalUnitaryTo-10000-122.80µs ±20%2.63µs ± 7%~(p=0.424 n=10+10)
ScalUnitaryTo/ScalUnitaryTo-50000-1223.0µs ± 0%22.9µs ± 0%~(p=0.200 n=9+8)
ScalInc/ScalInc-1-inc(1)-122.84ns ± 1%2.98ns ± 1%+4.66%(p=0.000 n=8+8)
ScalInc/ScalInc-2-inc(1)-123.52ns ± 1%3.41ns ± 1%−2.95%(p=0.000 n=8+8)
ScalInc/ScalInc-2-inc(2)-123.56ns ± 7%3.42ns ± 4%−4.02%(p=0.001 n=9+9)
ScalInc/ScalInc-2-inc(4)-123.55ns ± 2%3.54ns ±10%~(p=0.389 n=9+10)
ScalInc/ScalInc-2-inc(10)-123.50ns ± 1%3.41ns ± 1%−2.79%(p=0.000 n=9+9)
ScalInc/ScalInc-3-inc(1)-124.03ns ± 8%3.95ns ±11%~(p=0.238 n=10+10)
ScalInc/ScalInc-3-inc(2)-123.81ns ± 0%3.75ns ± 0%−1.66%(p=0.000 n=8+9)
ScalInc/ScalInc-3-inc(4)-123.95ns ±10%3.75ns ± 1%−5.09%(p=0.000 n=10+8)
ScalInc/ScalInc-3-inc(10)-123.82ns ± 0%3.81ns ± 7%~(p=0.053 n=8+9)
ScalInc/ScalInc-4-inc(1)-124.54ns ± 4%4.23ns ± 5%−6.83%(p=0.000 n=8+9)
ScalInc/ScalInc-4-inc(2)-124.54ns ± 2%4.21ns ± 2%−7.32%(p=0.000 n=9+9)
ScalInc/ScalInc-4-inc(4)-124.56ns ± 4%4.25ns ± 5%−6.85%(p=0.000 n=9+9)
ScalInc/ScalInc-4-inc(10)-124.50ns ± 0%4.19ns ± 1%−6.97%(p=0.000 n=8+8)
ScalInc/ScalInc-5-inc(1)-125.13ns ± 1%4.96ns ± 0%−3.24%(p=0.000 n=8+8)
ScalInc/ScalInc-5-inc(2)-125.37ns ±11%4.98ns ± 1%−7.35%(p=0.000 n=10+8)
ScalInc/ScalInc-5-inc(4)-125.22ns ± 3%5.07ns ± 5%−2.85%(p=0.038 n=9+9)
ScalInc/ScalInc-5-inc(10)-125.11ns ± 0%5.20ns ± 9%~(p=0.712 n=9+9)
ScalInc/ScalInc-10-inc(1)-128.01ns ± 4%8.85ns ±14%+10.49%(p=0.005 n=8+10)
ScalInc/ScalInc-10-inc(2)-128.35ns ±22%8.10ns ± 1%~(p=0.243 n=9+9)
ScalInc/ScalInc-10-inc(4)-129.10ns ±14%8.13ns ± 4%~(p=0.479 n=10+9)
ScalInc/ScalInc-10-inc(10)-128.70ns ±16%8.12ns ± 0%~(p=0.501 n=10+8)
ScalInc/ScalInc-500-inc(1)-12457ns ±17%133ns ± 2%−70.97%(p=0.000 n=10+8)
ScalInc/ScalInc-500-inc(2)-12435ns ± 3%136ns ± 9%−68.79%(p=0.000 n=8+9)
ScalInc/ScalInc-500-inc(4)-12444ns ±13%139ns ±15%−68.75%(p=0.000 n=9+10)
ScalInc/ScalInc-500-inc(10)-12447ns ± 8%152ns ± 1%−66.12%(p=0.000 n=9+8)
ScalInc/ScalInc-1000-inc(1)-12924ns ±17%259ns ± 1%−72.01%(p=0.000 n=10+9)
ScalInc/ScalInc-1000-inc(2)-12940ns ±12%267ns ± 8%−71.55%(p=0.000 n=10+10)
ScalInc/ScalInc-1000-inc(4)-12946ns ±22%285ns ± 9%−69.87%(p=0.000 n=10+9)
ScalInc/ScalInc-1000-inc(10)-121.61µs ± 0%1.63µs ± 0%+1.15%(p=0.000 n=8+9)
ScalInc/ScalInc-10000-inc(1)-128.76µs ± 0%3.07µs ± 0%−64.98%(p=0.000 n=9+10)
ScalInc/ScalInc-10000-inc(2)-129.32µs ± 9%4.46µs ±19%−52.11%(p=0.000 n=10+10)
ScalInc/ScalInc-10000-inc(4)-1211.1µs ± 4%11.1µs ±12%~(p=0.604 n=9+10)
ScalInc/ScalInc-10000-inc(10)-1224.3µs ± 9%24.8µs ±17%~(p=0.725 n=10+10)
ScalIncTo/ScalIncTo-1-inc(1)-123.77ns ± 3%3.93ns ±10%+4.04%(p=0.003 n=9+10)
ScalIncTo/ScalIncTo-2-inc(1)-124.56ns ± 4%4.49ns ± 5%−1.65%(p=0.023 n=9+10)
ScalIncTo/ScalIncTo-2-inc(2)-124.52ns ± 1%4.43ns ± 0%−2.09%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-2-inc(4)-124.53ns ± 1%4.42ns ± 0%−2.35%(p=0.000 n=8+9)
ScalIncTo/ScalIncTo-2-inc(10)-124.84ns ±23%4.43ns ± 1%−8.43%(p=0.000 n=10+8)
ScalIncTo/ScalIncTo-3-inc(1)-124.99ns ± 4%4.76ns ± 0%−4.68%(p=0.000 n=8+8)
ScalIncTo/ScalIncTo-3-inc(2)-124.97ns ± 1%4.77ns ± 1%−4.19%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-3-inc(4)-124.97ns ± 1%4.89ns ± 9%~(p=0.163 n=8+10)
ScalIncTo/ScalIncTo-3-inc(10)-125.08ns ±12%4.76ns ± 0%−6.30%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-4-inc(1)-125.76ns ± 4%5.19ns ± 5%−9.87%(p=0.000 n=9+9)
ScalIncTo/ScalIncTo-4-inc(2)-125.76ns ± 3%5.13ns ± 0%−11.01%(p=0.000 n=9+9)
ScalIncTo/ScalIncTo-4-inc(4)-126.02ns ±14%5.13ns ± 1%−14.73%(p=0.000 n=10+8)
ScalIncTo/ScalIncTo-4-inc(10)-125.72ns ± 1%5.19ns ± 3%−9.15%(p=0.000 n=8+9)
ScalIncTo/ScalIncTo-5-inc(1)-126.08ns ± 1%5.60ns ± 0%−8.01%(p=0.000 n=8+9)
ScalIncTo/ScalIncTo-5-inc(2)-126.12ns ± 3%5.72ns ± 3%−6.48%(p=0.000 n=8+10)
ScalIncTo/ScalIncTo-5-inc(4)-126.07ns ± 0%5.64ns ± 2%−7.17%(p=0.000 n=9+9)
ScalIncTo/ScalIncTo-5-inc(10)-126.07ns ± 0%5.67ns ± 2%−6.67%(p=0.000 n=8+8)
ScalIncTo/ScalIncTo-10-inc(1)-129.54ns ±16%9.26ns ± 0%~(p=0.496 n=10+8)
ScalIncTo/ScalIncTo-10-inc(2)-129.02ns ± 1%10.06ns ±22%+11.63%(p=0.000 n=8+10)
ScalIncTo/ScalIncTo-10-inc(4)-129.06ns ± 2%9.28ns ± 1%+2.42%(p=0.001 n=9+8)
ScalIncTo/ScalIncTo-10-inc(10)-129.63ns ±11%9.27ns ± 1%~(p=0.498 n=10+8)
ScalIncTo/ScalIncTo-500-inc(1)-12457ns ±13%136ns ± 1%−70.34%(p=0.000 n=10+9)
ScalIncTo/ScalIncTo-500-inc(2)-12436ns ± 0%135ns ± 0%−69.06%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-500-inc(4)-12440ns ± 0%219ns ± 4%−50.23%(p=0.000 n=8+8)
ScalIncTo/ScalIncTo-500-inc(10)-121.00µs ± 0%1.00µs ± 0%−0.24%(p=0.012 n=9+9)
ScalIncTo/ScalIncTo-1000-inc(1)-12875ns ± 0%261ns ± 0%−70.16%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-1000-inc(2)-12878ns ± 0%362ns ± 8%−58.72%(p=0.000 n=8+10)
ScalIncTo/ScalIncTo-1000-inc(4)-121.01µs ± 1%1.01µs ± 1%−0.83%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-1000-inc(10)-122.09µs ± 9%2.03µs ± 4%−2.89%(p=0.013 n=10+9)
ScalIncTo/ScalIncTo-10000-inc(1)-128.77µs ± 0%3.35µs ± 1%−61.81%(p=0.000 n=9+8)
ScalIncTo/ScalIncTo-10000-inc(2)-129.16µs ± 1%7.16µs ± 6%−21.90%(p=0.000 n=9+10)
ScalIncTo/ScalIncTo-10000-inc(4)-1219.0µs ±11%18.7µs ± 8%~(p=0.529 n=10+10)
ScalIncTo/ScalIncTo-10000-inc(10)-1238.5µs ± 0%38.5µs ± 1%~(p=0.624 n=8+8)
 
Kunde21 commented 7 years ago

All of @vladimir-ch's comments corrected in commit on 54.