performance of + and *, suggested improvement

I'm opening a separate issue, about the things I mentioned in https://github.com/scheinerman/Mods.jl/issues/25. This is done, so that my suggestion of improvement is a bit more visible, for later reference, and to give some benchmarks.

The problem is the slow computation of matrix products:

julia> using SaferIntegers, Mods
julia> begin n=10^3;  m=11;
       x = Int.(      rand(-9:9,(n,n)));  @time x*x; 
       y = SafeInt64.(rand(-9:9,(n,n)));  @time (y*y) .%m;
       z = Mod{m}.(   rand(-9:9,(n,n)));  @time z*z; end;
  0.434071 seconds (5 allocations: 7.659 MiB)
  1.476792 seconds (9 allocations: 15.289 MiB)
538.278712 seconds (6.17 G allocations: 114.895 GiB, 35.82% gc time)

My suggestion is to use the following code, which will hopefully be integrated (eventually, in some form) into Mods.jl:

struct Mod8{m}
    val ::UInt8 
    function Mod8{m}(val ::tw) where {m,tw <:UInt8}
        return new{m}(mod(val,m)) end end;  # Note: mod(val,m) is always positive, but rem(val,m) = val % m preserves the sign.
struct Mod16{m}
    val ::UInt16
    function Mod16{m}(val ::tw) where {m,tw <:UInt16}
        return new{m}(mod(val,m)) end end;
struct Mod32{m}
    val ::UInt32
    function Mod32{m}(val ::tw) where {m,tw <:UInt32}
        return new{m}(mod(val,m)) end end;
struct Mod64{m}
    val ::UInt64
    function Mod64{m}(val ::tw) where {m,tw <:UInt64}
        return new{m}(mod(val,m)) end end;
struct Mod128{m}
    val ::UInt128
    function Mod128{m}(val ::tw) where {m,tw <:UInt128}
        return new{m}(mod(val,m)) end end;
function modIntType(m ::Integer) ::DataType
    #= Create a type that has enough space so that overflow does not happen. 
    The condition m ≤ sqrt(typemax(UIntN)) makes sure that multiplication val*val is never larger than the underlying UIntN. =#
    @assert 2 ≤ m  
    if     m ≤ sqrt(typemax(UInt8))     return Mod8{  UInt8(  m)}
    elseif m ≤ sqrt(typemax(UInt16))    return Mod16{ UInt16( m)}
    elseif m ≤ sqrt(typemax(UInt32))    return Mod32{ UInt32( m)}
    elseif m ≤ sqrt(typemax(UInt64))    return Mod64{ UInt64( m)}
    elseif m ≤ sqrt(typemax(UInt128))   return Mod128{UInt128(m)}
    else error("modulus m must be in [2,...,2^128-1]") end end;
for T ∈ (Mod8,Mod16,Mod32,Mod64,Mod128)
    for tw ∈ (Int8,Int16,Int32,Int64,Int128,BigInt,SafeInt8,SafeInt16,SafeInt32,SafeInt64,SafeInt128)
        T{m}(val ::tw) where m = T{m}(typeof(m)(mod(val,m))) end;
    Base.zero(::T{m}) where m = T{m}(typeof(m)(0));
    Base.one( ::T{m}) where m = T{m}(typeof(m)(1));
    Base.:+(x ::T{m}) where m = x;
    Base.:-(x ::T{m}) where m = T{m}(m - x.val);
    Base.:+(x ::T{m}, y ::T{m}) where m = T{m}(x.val + y.val);
    Base.:-(x ::T{m}, y ::T{m}) where m = T{m}(x.val + m - y.val);
    Base.:*(x ::T{m}, y ::T{m}) where m = T{m}(x.val * y.val);
    Base.inv(x ::T{m}) where m = T{m}(invmod(x.val, m)); 
    Base.:^(x ::T{m}, k ::Integer) where m = T{m}(powermod(x.val, k, m)); 
    Base.:*(x ::T{m}, y ::Integer) where m = *(x, T{m}(y));
    Base.:*(x ::Integer, y ::T{m}) where m = *(T{m}(x), y);
    function Base.show(io ::IO, x ::T) ::Nothing  
        print(io, BigInt(x.val)) end; end;

Notice that no overflow occurs in + and *, since in Mod8, the values go from 0,...,2^8-1 but modulus is at most 2^4-1, in Mod16, the values go from 0,...,2^16-1 but modulus is at most 2^8-1, etc.

To compare the efficiency of my implementation to existing ones (and Float's and Int's), consider these benchmarks:

julia> using Mods, FiniteFields, GaloisFields, AbstractAlgebra, Nemo, Singular

julia> for T ∈ [modIntType(11), GaloisFields.GaloisField(11), FiniteFields.FFE{11}, AbstractAlgebra.GF(11), Singular.residue_ring(Singular.ZZ, 11), Nemo.FiniteField(11)[1]]
           n=10^3;  x = rand(T.(0:10),(n,n));  print(sizeof(x),"   ");  @time x*x; end
1000000     3.665314 seconds (5 allocations: 1008.172 KiB)
1000000     7.270208 seconds (5 allocations: 1008.172 KiB)
4000000    17.841132 seconds (5 allocations: 3.845 MiB)
16000000   19.344726 seconds (2 allocations: 15.259 MiB, 0.18% gc time)
8000000   682.773657 seconds (3.83 G allocations: 86.878 GiB, 20.13% gc time)
16000000   12.845152 seconds (2 allocations: 15.259 MiB)

julia> for T ∈ [Float64, Int64, SafeInt64, modIntType(11), GaloisFields.GaloisField(11), FiniteFields.FFE{11}, AbstractAlgebra.GF(11), Nemo.FiniteField(11)[1]]
           n=10^4;  x = rand(T.(0:10),(n,n));  print(sizeof(x),"   ");  @time x*x; end
800000000       15.452075 seconds (2 allocations: 762.939 MiB, 0.02% gc time)
800000000      477.965464 seconds (5 allocations: 762.970 MiB, 0.00% gc time)
800000000     1429.740835 seconds (5 allocations: 762.970 MiB, 0.00% gc time)
100000000     3684.688773 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
100000000     7075.917578 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
400000000    16646.365626 seconds (5 allocations: 381.500 MiB, 0.00% gc time)
1600000000   52944.319672 seconds (2 allocations: 1.490 GiB, 0.00% gc time)
1600000000   28602.384860 seconds (2 allocations: 1.490 GiB, 0.00% gc time)

As we can see, my implementation is the most performant of the 5 available ones. It still needs 1h to compute the product of 10Kx10K matrices, but the other methods need 2, 5, 14, 8 h, respectively. It is frustrating since the multiplication of Floats requires mere 16 seconds. My hope is that one day, Octavian.jl or Tullio.jl might be used, to speed up this calculation.

The author already upgraded the package, following some of my suggestions. But as we see below, there's still room for improvement:

julia> for T ∈ [modIntType(11), Mod{11}, MiniMod{11}]
           n=10^4;  x = rand(T.(0:10),(n,n));  print(sizeof(x),"   ");  @time x*x; end 
100000000   3629.556939 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
800000000   5584.394732 seconds (5 allocations: 762.970 MiB, 0.00% gc time)
100000000   6322.762263 seconds (5 allocations: 95.398 MiB, 0.00% gc time)

I am happy to report that Tullio.jl works with my modIntTypes and improves the benchmark:

for T ∈ [Mod8, Mod16, Mod32, Mod64, Mod128]
    @eval function matmul_tullio(x ::Matrix{$T{m}}, y ::Matrix{$T{m}}) ::Matrix{$T{m}} where m
        Tullio.@tullio z[i,k] := x[i,j] * y[j,k];     return z end; end;

julia> T=modIntType(11);  n=10^4;  x = rand(T.(0:10),(n,n));  print(sizeof(x),"   ");  @time matmul_tullio(x,x);
100000000   888.368574 seconds (176 allocations: 95.377 MiB, 0.00% gc time)

You could try version 1 of Mods.

A better solution is to add the generic type support in version 1 back to this package.

Int32 data type is quite nessesary for getting performance.
Another reason is about compatibility. Requiring using Int causes GPU codes to fail, because computing * and + on finite field algebra requires wide operations that converts Int to Int128, while not all architectures support Int128.
Finally, it is not practical to require packages depending on an older version of Mods, because it means this package will be incompatible with any other package using latest Mods. @scheinerman We'd better have a generic implementation to make the package flexible to use, meanwhile, a more strict input check to help user get rid of troubles.

Finally, a remark on package versioning: we'd better bump the main version after a major type change @scheinerman . The introduced definition in v2 breaked the package GenericTensorNetworks.jl, because we always specify dependency compatibility based on the first non-zero version. Please check: https://pkgdocs.julialang.org/v1/compatibility/ Please let me know if you need help in reviewing the PRs. :D

@GiggleLiu Is there any difference in performance when using IntN vs UIntN? Isn't using the latter more convenient (more space to work with)? Are there any problems with UIntN on GPUs?

Also, @scheinerman's current implementation contains https://github.com/scheinerman/Mods.jl/blob/60e8741af46a88404bc2a9f98d250002d5910695/src/arithmetic.jl#L5 https://github.com/scheinerman/Mods.jl/blob/60e8741af46a88404bc2a9f98d250002d5910695/src/arithmetic.jl#L24

I suspect that enlarging the dtype for billions of entries in a large Array is not good for performance, no?

I have a more difficult, low-level question @GiggleLiu: How could we achieve that + and * of Mod could outperform that of Int, especially with smaller modulus? Intuitively, arithmetic modulo e.g. 11 is simpler than full arithmetic on Int's. Is there a way to edit how bits are carried over in these operations? Or are Intoperations built into CPUs while Mod are not, and there's nothing we can do about it?

Hi @GiggleLiu : The widen will only happen if the modulus is 2^31 or larger. And if the modulus is less than 256, do consider my new MiniMods module. Does this help?

@GiggleLiu Is there any difference in performance when using IntN vs UIntN? Isn't using the latter more convenient (more space to work with)? Are there any problems with UIntN on GPUs?

Not much different between UInt and Int. In practise, I find Int faster, maybe due to better instruction support? There is no problem with UInt on GPUs.

Also, @scheinerman's current implementation contains

https://github.com/scheinerman/Mods.jl/blob/60e8741af46a88404bc2a9f98d250002d5910695/src/arithmetic.jl#L5

https://github.com/scheinerman/Mods.jl/blob/60e8741af46a88404bc2a9f98d250002d5910695/src/arithmetic.jl#L24

I suspect that enlarging the dtype for billions of entries in a large Array is not good for performance, no?

Yeah, that does not run on GPU.

julia> using CUDA, Mods

julia> a = Mod{typemax(Int)}.(CUDA.CuArray(rand(Int, 50)));

julia> a .* a;
ERROR: LLVM error: Undefined external symbol "__divti3"
Stacktrace:
  [1] handle_error(reason::Cstring)
    @ LLVM ~/.julia/packages/LLVM/OLMpi/src/core/context.jl:168
  [2] LLVMTargetMachineEmitToMemoryBuffer(T::LLVM.TargetMachine, M::LLVM.Module, codegen::LLVM.API.LLVMCodeGenFileType, ErrorMessage::Base.RefValue{…}, OutMemBuf::Base.RefValue{…})
    @ LLVM.API ~/.julia/packages/LLVM/OLMpi/lib/15/libLLVM_h.jl:4326

I have a more difficult, low-level question @GiggleLiu: How could we achieve that + and * of Mod could outperform that of Int, especially with smaller modulus?

Oh, Int operation only takes 1 CPU clock (the minimum amount of time). You need to design a specialized machine for finite field computing.

Intuitively, arithmetic modulo e.g. 11 is simpler than full arithmetic on Int's. Is there a way to edit how bits are carried over in these operations? Or are Intoperations built into CPUs while Mod are not, and there's nothing we can do about it?

I do not think there are anything we can do except building a new chip. BTW, floating point operations can be much faster than Int in matrix multiplication due to the special hardware level support of fma and SIMD.

Hi @GiggleLiu : The widen will only happen if the modulus is 2^31 or larger. And if the modulus is less than 256, do consider my new MiniMods module. Does this help?

Unfortunately no. I am using finite field algebra + Chinese remainder theorem for computing numbers much larger than 2^64. Normally, I choose largest prime numbers as the modulo.

OK, there are a couple of improvements I can think of. We can't help widening when multiplying mod N numbers when N > typemax(Int32). But I think part of the performance hit might be incessant calls to mod. We can also avoid widening on addition when N < typemax(Int64)/2.

What I can do is make calls the mod contingent on the value lying outside the range [0,N). This means having an if 0 <= value < N clause, but that should be quick compared to always doing mod(val,N).

Thoughts?

Happily, I report that my types work on GPUs:

julia> using CUDA
julia> for T ∈ modIntType.([11, typemax(UInt32)]) 
       n=10^4;  x=rand(T.(0:10),(n,n));  println(sizeof(x)," ",T);  @time x*x;  y=cu(x);  CUDA.@time y*y; end
100000000 Mod8{0x0b}   
    4149.268271 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
      28.927884 seconds (72 CPU allocations: 3.281 KiB) (1 GPU allocation: 95.367 MiB, 0.00% memmgmt time)
800000000 Mod64{0x00000000ffffffff}   
    3348.985094 seconds (5 allocations: 762.970 MiB, 0.00% gc time)
      54.412503 seconds (72 CPU allocations: 3.281 KiB) (1 GPU allocation: 762.939 MiB, 0.00% memmgmt time)

Oh, Int operation only takes 1 CPU clock (the minimum amount of time). You need to design a specialized machine for finite field computing. I do not think there are anything we can do except building a new chip. BTW, floating point operations can be much faster than Int in matrix multiplication due to the special hardware level support of fma and SIMD.

Too bad. Just out of curiosity, if someone with enough resources would actually design such a chip, could + and be covered for all moduli m, or would there have to be 2^32-1 individual operations defined, like +m(x,y) and m(x,y) for m=2,...,2^32?

We can't help widening when multiplying mod N numbers when N > typemax(Int32).

You could avoid widening, by limiting the size of the modulus in a given UInt type, like my modIntType does.

But I think part of the performance hit might be incessant calls to mod.

I don't think mod is the problem, since I use it and the implementation is fast.

What I can do is make calls the mod contingent on the value lying outside the range [0,N). This means having an if 0 <= value < N clause, but that should be quick compared to always doing mod(val,N).

I'm not sure if if ... else ... is faster than val % m. Best if you try it out and compare.

scheinerman / Mods.jl

performance of + and *, suggested improvement #31