nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.63k stars 1.47k forks source link

Performance: old school slicing should generate inline code or do constant-folding #16887

Open mratsim opened 3 years ago

mratsim commented 3 years ago

I was trying to optimize extremely performance sensitive big integer / cryptographic code where somehow I couldn't reach the speed of other Assembly implementations.

It turned out I was initializing an array with this line

`scratchSym`[0 .. `N`-1] = `a_MR`.toOpenArray(0, `N`-1)

Using this line instead, with staticFor being a compile-time loop unroller improved performance by 30%, previously my code took 70 cycles and now it takes 50 cycles.

staticFor i, 0, `N`: # Do NOT use Nim slice/toOpenArray, they are not inlined
    `scratchSym`[i] = `a_MR`[i]

(state-of-the art C++ JIT-ed code takes 55 cycles on my machine).

Profiling: image

image

C code

N_LIB_PRIVATE N_NIMCALL(void, montRed_asm_adx_bmi2__ENBcQkN09aE9aDZ9c67ohtYkg)(NU64* r, tyArray__QpmKhiiBZ9b9bMXYThob0NHQ t, tyArray__4RmROn7lE6QlXejY1MVycQ M, NU64 m0ninv) {
    NU64 hi;
    NU64 lo;
    NU64 rdx;
    tyArray__BoAqkBLQz0ZcxF5EXnmcOA scratch;
    tyObject_HSlice__EE5dzjqoOrHT6HJhIPXAvA T1_;
    tyArray__T9bTwSavBMQqAy6Syjcf55Q tsub;
    nimfr_("montRed_asm_adx_bmi2", "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    nimln_(126, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    rdx = m0ninv;
    nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    nimln_(127, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    T1_ = dotdot___BokNSDrKN1xmV1nA01G9brAsystem(((NI) 0), ((NI) 5));
    if (((NI) 5)-((NI) 0) != -1 && (((NI) 5)-((NI) 0) < -1 || ((NI) 0) < 0 || ((NI) 0) > 11 || ((NI) 5) < 0 || ((NI) 5) > 11)){ raiseIndexError(); }
    nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
    X5BX5Deq___1R7esKTmlP09aog9cZijvMMg(scratch, T1_, (NU64*)((t)+(((NI) 0))), (((NI) 5))-(((NI) 0))+1);
    nimln_(175, "/home/beta/Programming/Nim/constantine/constantine/primitives/macro_assembler_x86.nim");
    asm volatile(
"# \n"
"imulq %[scratch0], %[rdx]\n"
"# ---- Reduction 0\n"
"xorq %[scratch6], %[scratch6]\n"
"# \n"
"mulxq (%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch0]\n"
"adoxq %[hi], %[scratch1]\n"
"# \n"
"mulxq 8(%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch1]\n"
"adoxq %[hi], %[scratch2]\n"
// .............
Araq commented 3 years ago

Slicing via scratchSym[0 .. N-1] is the problem here, not toOpenarray.

mratsim commented 3 years ago

Ah indeed, so this should be inlined: https://github.com/nim-lang/Nim/blob/version-1-4/lib/system.nim#L2528-L2540

https://github.com/nim-lang/Nim/blob/1d8b7aa07ca9989b80dd758d66c7f4ba7dc533f7/lib/system.nim#L2528-L2540