Closed lampretl closed 9 months ago
The author already upgraded the package, following some of my suggestions. But as we see below, there's still room for improvement:
julia> for T ∈ [modIntType(11), Mod{11}, MiniMod{11}]
n=10^4; x = rand(T.(0:10),(n,n)); print(sizeof(x)," "); @time x*x; end
100000000 3629.556939 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
800000000 5584.394732 seconds (5 allocations: 762.970 MiB, 0.00% gc time)
100000000 6322.762263 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
I am happy to report that Tullio.jl works with my modIntType
s and improves the benchmark:
for T ∈ [Mod8, Mod16, Mod32, Mod64, Mod128]
@eval function matmul_tullio(x ::Matrix{$T{m}}, y ::Matrix{$T{m}}) ::Matrix{$T{m}} where m
Tullio.@tullio z[i,k] := x[i,j] * y[j,k]; return z end; end;
julia> T=modIntType(11); n=10^4; x = rand(T.(0:10),(n,n)); print(sizeof(x)," "); @time matmul_tullio(x,x);
100000000 888.368574 seconds (176 allocations: 95.377 MiB, 0.00% gc time)
You could try version 1 of Mods
.
A better solution is to add the generic type support in version 1 back to this package.
Int32
data type is quite nessesary for getting performance. Int
causes GPU codes to fail, because computing *
and +
on finite field algebra requires wide operations that converts Int
to Int128
, while not all architectures support Int128
.Mods
, because it means this package will be incompatible with any other package using latest Mods
. @scheinerman We'd better have a generic implementation to make the package flexible to use, meanwhile, a more strict input check to help user get rid of troubles.Finally, a remark on package versioning: we'd better bump the main version after a major type change @scheinerman . The introduced definition in v2 breaked the package GenericTensorNetworks.jl
, because we always specify dependency compatibility based on the first non-zero version. Please check: https://pkgdocs.julialang.org/v1/compatibility/
Please let me know if you need help in reviewing the PRs. :D
@GiggleLiu Is there any difference in performance when using IntN
vs UIntN
? Isn't using the latter more convenient (more space to work with)? Are there any problems with UIntN
on GPUs?
Also, @scheinerman's current implementation contains https://github.com/scheinerman/Mods.jl/blob/60e8741af46a88404bc2a9f98d250002d5910695/src/arithmetic.jl#L5 https://github.com/scheinerman/Mods.jl/blob/60e8741af46a88404bc2a9f98d250002d5910695/src/arithmetic.jl#L24
I suspect that enlarging the dtype for billions of entries in a large Array
is not good for performance, no?
I have a more difficult, low-level question @GiggleLiu: How could we achieve that +
and *
of Mod
could outperform that of Int
, especially with smaller modulus? Intuitively, arithmetic modulo e.g. 11 is simpler than full arithmetic on Int
's. Is there a way to edit how bits are carried over in these operations? Or are Int
operations built into CPUs while Mod
are not, and there's nothing we can do about it?
Hi @GiggleLiu : The widen will only happen if the modulus is 2^31 or larger. And if the modulus is less than 256, do consider my new MiniMods
module. Does this help?
@GiggleLiu Is there any difference in performance when using
IntN
vsUIntN
? Isn't using the latter more convenient (more space to work with)? Are there any problems withUIntN
on GPUs?
Not much different between UInt
and Int
. In practise, I find Int
faster, maybe due to better instruction support?
There is no problem with UInt
on GPUs.
Also, @scheinerman's current implementation contains
I suspect that enlarging the dtype for billions of entries in a large
Array
is not good for performance, no?
Yeah, that does not run on GPU.
julia> using CUDA, Mods
julia> a = Mod{typemax(Int)}.(CUDA.CuArray(rand(Int, 50)));
julia> a .* a;
ERROR: LLVM error: Undefined external symbol "__divti3"
Stacktrace:
[1] handle_error(reason::Cstring)
@ LLVM ~/.julia/packages/LLVM/OLMpi/src/core/context.jl:168
[2] LLVMTargetMachineEmitToMemoryBuffer(T::LLVM.TargetMachine, M::LLVM.Module, codegen::LLVM.API.LLVMCodeGenFileType, ErrorMessage::Base.RefValue{…}, OutMemBuf::Base.RefValue{…})
@ LLVM.API ~/.julia/packages/LLVM/OLMpi/lib/15/libLLVM_h.jl:4326
I have a more difficult, low-level question @GiggleLiu: How could we achieve that
+
and*
ofMod
could outperform that ofInt
, especially with smaller modulus?
Oh, Int
operation only takes 1 CPU clock (the minimum amount of time). You need to design a specialized machine for finite field computing.
Intuitively, arithmetic modulo e.g. 11 is simpler than full arithmetic on
Int
's. Is there a way to edit how bits are carried over in these operations? Or areInt
operations built into CPUs whileMod
are not, and there's nothing we can do about it?
I do not think there are anything we can do except building a new chip. BTW, floating point operations can be much faster than Int
in matrix multiplication due to the special hardware level support of fma
and SIMD.
Hi @GiggleLiu : The widen will only happen if the modulus is 2^31 or larger. And if the modulus is less than 256, do consider my new
MiniMods
module. Does this help?
Unfortunately no. I am using finite field algebra + Chinese remainder theorem for computing numbers much larger than 2^64. Normally, I choose largest prime numbers as the modulo.
OK, there are a couple of improvements I can think of. We can't help widening when multiplying mod N
numbers when N > typemax(Int32)
. But I think part of the performance hit might be incessant calls to mod
. We can also avoid widening on addition when N < typemax(Int64)/2
.
What I can do is make calls the mod
contingent on the value lying outside the range [0,N)
. This means having an if 0 <= value < N
clause, but that should be quick compared to always doing mod(val,N)
.
Thoughts?
Happily, I report that my types work on GPUs:
julia> using CUDA
julia> for T ∈ modIntType.([11, typemax(UInt32)])
n=10^4; x=rand(T.(0:10),(n,n)); println(sizeof(x)," ",T); @time x*x; y=cu(x); CUDA.@time y*y; end
100000000 Mod8{0x0b}
4149.268271 seconds (5 allocations: 95.398 MiB, 0.00% gc time)
28.927884 seconds (72 CPU allocations: 3.281 KiB) (1 GPU allocation: 95.367 MiB, 0.00% memmgmt time)
800000000 Mod64{0x00000000ffffffff}
3348.985094 seconds (5 allocations: 762.970 MiB, 0.00% gc time)
54.412503 seconds (72 CPU allocations: 3.281 KiB) (1 GPU allocation: 762.939 MiB, 0.00% memmgmt time)
Oh, Int operation only takes 1 CPU clock (the minimum amount of time). You need to design a specialized machine for finite field computing. I do not think there are anything we can do except building a new chip. BTW, floating point operations can be much faster than Int in matrix multiplication due to the special hardware level support of fma and SIMD.
Too bad. Just out of curiosity, if someone with enough resources would actually design such a chip, could + and be covered for all moduli m, or would there have to be 2^32-1 individual operations defined, like +m(x,y) and m(x,y) for m=2,...,2^32?
We can't help widening when multiplying mod N numbers when N > typemax(Int32).
You could avoid widening, by limiting the size of the modulus in a given UInt type, like my modIntType
does.
But I think part of the performance hit might be incessant calls to mod.
I don't think mod
is the problem, since I use it and the implementation is fast.
What I can do is make calls the mod contingent on the value lying outside the range [0,N). This means having an if 0 <= value < N clause, but that should be quick compared to always doing mod(val,N).
I'm not sure if if ... else ...
is faster than val % m
. Best if you try it out and compare.
I'm opening a separate issue, about the things I mentioned in https://github.com/scheinerman/Mods.jl/issues/25. This is done, so that my suggestion of improvement is a bit more visible, for later reference, and to give some benchmarks.
The problem is the slow computation of matrix products:
My suggestion is to use the following code, which will hopefully be integrated (eventually, in some form) into Mods.jl:
Notice that no overflow occurs in
+
and*
, since inMod8
, the values go from 0,...,2^8-1 but modulus is at most 2^4-1, inMod16
, the values go from 0,...,2^16-1 but modulus is at most 2^8-1, etc.To compare the efficiency of my implementation to existing ones (and Float's and Int's), consider these benchmarks:
As we can see, my implementation is the most performant of the 5 available ones. It still needs 1h to compute the product of 10Kx10K matrices, but the other methods need 2, 5, 14, 8 h, respectively. It is frustrating since the multiplication of Floats requires mere 16 seconds. My hope is that one day, Octavian.jl or Tullio.jl might be used, to speed up this calculation.