status-im / nim-stint

Stack-based arbitrary-precision integers - Fast and portable with natural syntax for resource-restricted devices.
Apache License 2.0
83 stars 11 forks source link

Array backend TODO list #129

Open jangko opened 1 year ago

jangko commented 1 year ago

This is a refactoring to bring use an array backend. The actual bit representation will be 100%-bit compatible with the previous one (i.e. casting is possible). And as Stint integers are as if uint64 were extended to uintXXX it should be compatible with the newly exposed iXXX from LLVM/Clang _ExtInt http://blog.llvm.org/2020/04/the-new-clang-extint-feature-provides.html

Status

The status takes into account passing a whole test suite file. Very light changes may be needed if the test suite was using internal proc or representations like swapBytes or UintImpl The test suite is run at both compile-time and run-time. Tests were only done on littleEndian architectures

Overview

This refactor will bring significant speed and code size improvements that were tested in Constantine.

It also uses the `{.push raises: [].} pragmas to enforce error handling

Addition / Substraction

The codegen should be optimal in terms of codesize with a single ADD + ADC + ADC + ADC for uint256 on x86 This uses the intrinsics addcarry_u64 and subborrow_u64 for GCC/Clang/MSVC. With uint128 fallback for ARM/WASM. (TODO, use Clang-only intrinsics __builtin_addc and __builtin_subc)

In particular this solves #87.

Note that it is strongly recommended to use anything but GCC (i.e. Clang, MSVC, ICC) as GCC is the absolute worst compiler for multiprecision arithmetic, see https://gcc.godbolt.org/z/2h768y

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

Multiplication

Multiplication uses the Comba multiplication technique (a.k.a Product Scanning) that reorder the operations to significantly reduces the number of carries at the price of a temporary buffer of size 3 words and convoluted loop scheduling: https://github.com/status-im/nim-stint/blob/01ad29e7b7989b4e5b3e5ab8e6ad9d07d250667d/stint/private/uint_mul.nim#L28-L40 The benefits of comba multiplication is 30% speed improvement due to better register usage.

Also the code allows mul with high bit truncation, extended precision multiplication (mul(256, 256) -> 512) and also multiplication while only keeping the higher bits so that modular arithmetic can use a more efficient Barret Reduction. This should significantly speed up the EVM.

A specific squaring operation will be added to accelerate exponentiation (~20%)

Modular arithmetic

IO

Serialization from hex will be made significantly faster. Notation: "w" the number of words in the big int, b the number of bytes to read in the hex strings

BigInt multiplication is O(w²) and for each hex bytes read we do a multiplication for O(b * w²) Instead we can:

  1. replace multiplication by a shift for complexity O(b * w)
  2. Do 2 passes to convert hex to bytearray and then bytearray to Stint O(2b)
  3. Do 1 pass O(b)

Solution 2 will be done (but solution 3 can be kept in mind if 2 passes are a bottleneck, which is very unlikely compared to bigint operations)

Future

Relevant criticism of the division implementation in the wild: https://skanthak.homepage.t-online.de/division.html