Performance: old school slicing should generate inline code or do constant-folding

mratsim commented 3 years ago

I was trying to optimize extremely performance sensitive big integer / cryptographic code where somehow I couldn't reach the speed of other Assembly implementations.

It turned out I was initializing an array with this line

`scratchSym`[0 .. `N`-1] = `a_MR`.toOpenArray(0, `N`-1)

Using this line instead, with staticFor being a compile-time loop unroller improved performance by 30%, previously my code took 70 cycles and now it takes 50 cycles.

staticFor i, 0, `N`: # Do NOT use Nim slice/toOpenArray, they are not inlined
    `scratchSym`[i] = `a_MR`[i]

(state-of-the art C++ JIT-ed code takes 55 cycles on my machine).

Profiling:

C code

N_LIB_PRIVATE N_NIMCALL(void, montRed_asm_adx_bmi2__ENBcQkN09aE9aDZ9c67ohtYkg)(NU64* r, tyArray__QpmKhiiBZ9b9bMXYThob0NHQ t, tyArray__4RmROn7lE6QlXejY1MVycQ M, NU64 m0ninv) {
    NU64 hi;
    NU64 lo;
    NU64 rdx;
    tyArray__BoAqkBLQz0ZcxF5EXnmcOA scratch;
    tyObject_HSlice__EE5dzjqoOrHT6HJhIPXAvA T1_;
    tyArray__T9bTwSavBMQqAy6Syjcf55Q tsub;
    nimfr_("montRed_asm_adx_bmi2", "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    nimln_(126, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    rdx = m0ninv;
    nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    nimln_(127, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    T1_ = dotdot___BokNSDrKN1xmV1nA01G9brAsystem(((NI) 0), ((NI) 5));
    if (((NI) 5)-((NI) 0) != -1 && (((NI) 5)-((NI) 0) < -1 || ((NI) 0) < 0 || ((NI) 0) > 11 || ((NI) 5) < 0 || ((NI) 5) > 11)){ raiseIndexError(); }
    nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    X5BX5Deq___1R7esKTmlP09aog9cZijvMMg(scratch, T1_, (NU64*)((t)+(((NI) 0))), (((NI) 5))-(((NI) 0))+1);
    nimln_(175, "/home/beta/Programming/Nim/constantine/constantine/primitives/macro_assembler_x86.nim");
    asm volatile(
"# \n"
"imulq %[scratch0], %[rdx]\n"
"# ---- Reduction 0\n"
"xorq %[scratch6], %[scratch6]\n"
"# \n"
"mulxq (%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch0]\n"
"adoxq %[hi], %[scratch1]\n"
"# \n"
"mulxq 8(%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch1]\n"
"adoxq %[hi], %[scratch2]\n"
// .............

Araq commented 3 years ago

Slicing via scratchSym[0 .. N-1] is the problem here, not toOpenarray.

mratsim commented 3 years ago

Ah indeed, so this should be inlined: https://github.com/nim-lang/Nim/blob/version-1-4/lib/system.nim#L2528-L2540

https://github.com/nim-lang/Nim/blob/1d8b7aa07ca9989b80dd758d66c7f4ba7dc533f7/lib/system.nim#L2528-L2540

nim-lang / Nim

Performance: old school slicing should generate inline code or do constant-folding #16887