Investigate slow binary

Multramate commented 5 years ago

Continuation of the conversation in #9 to improve efficiency of binary arithmetic.

Multramate commented 5 years ago

It is curious that you mentioned it now. This weekend I've been finalising patches to bitvec, adding a blazingly fast arithmetic of binary polynomials. Could you possibly give it a try before it is released? One can grab bitvec-HEAD from https://github.com/Bodigrim/bitvec and use Data.Bit.F2Poly.

I gave it a try, and benchmarked over several parameters, namely the leading power of the polynomials involved in the operations, the number of operations involved, and the leading power of the mod polynomial (which is relevant to BinaryField arithmetic). Keep it mind that our current implementation is purely manipulation of integers.

Results show that many multiplications of small polynomials (product [1 .. 10^6]) is initially slower for BinaryField than for F2Poly (by a factor of up to 3), but BinaryField scales better than F2Poly with increasing size of the mod polynomial (the former becomes faster than the latter when the mod polynomial has deg ~ 300 - our use cases vary from deg < 100 to deg > 500). On the other hand, multiplications of large polynomials (deg of polynomial ~ deg of mod polynomial) is significantly faster for F2Poly than for BinaryField (for a deg ~ 500 mod polynomial, it takes 1.2ms for BinaryField but only 0.06ms for F2Poly - great). Unfortunately, these are all multiplications - addition is always much faster for BinaryField (anywhere from a factor of 7 to a factor of 70). Arguably these operations are already very fast (in the nanosecond range), but they are used very often.

Additionally, I found a possible bug - doing fromInteger (0 :: F2Poly) returns a nonsensical Integral value, which also returns arithmetic underflow for Natural.

Bodigrim commented 5 years ago

That’s interesting, thanks for investigation. Could you please point me to the source code of benchmarks?

Multramate commented 5 years ago

I did them locally, but the code is exactly as follows:

module Main where

import Protolude

import Criterion.Main
import Data.Field.Galois -- change source code and rerun

main :: IO ()
main = defaultMain
  [ bench "Binary/sum" $
    nf sum z
  , bench "Binary/product" $
    nf product z
  , bench "Binary/addition" $
    nf (uncurry (+)) (x, y)
  , bench "Binary/multiplication" $
    nf (uncurry (*)) (x, y)
  ]

x :: Binary 0x80000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000425
x = 5547701740781519722635142596328196926911966227939046359094084536440568319715629280500715997622795290565075619575500349856402545234501339956177637642297104032840572435804027

y :: Binary 0x80000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000425
y = 6196555465349691279785996384838127547448114923541797426543534068992493058289537022289475955431649770234097149134032015677741886571782953737774719182732366741493639259919140

z :: [Binary 0x80000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000425]
z = fromInteger <$> [3 .. 10^6]

Bodigrim commented 5 years ago

Well, things like sum [3..10^6] are the very best scenario for Integer: values never can exceed a machine word, so it all boils down to xorI# and almost no boxing.

I made a patch, aiming to bring an addition of long F2Poly on par with Integer. Now both of them use a foreign call to mpn_xor_n.

The degradation of performance for product [3..10^6], when modulo grows, is surprising, especially while at the same time multiplication itself is much faster. Not sure what's going on here.

Multramate commented 5 years ago

Right, thanks. I will have another go over the weekend.

Bodigrim commented 5 years ago

I experimented a bit and it seems that product [3..10^6] is not a reliable benchmark for multiplication itself, because multipliers are so tiny in comparison to modulo, and depends a lot on a subtle interaction between Enum and Num instances. For example, at some point using z = [3..10^6] appeared to be 3x faster than z = fromInteger <$> [3..10^6].

Is product [3..10^6] a relevant computation for your domain? My understanding was that most of the time one multiplies a handful of polynomials, roughly of the same size with modulo.

Multramate commented 5 years ago

Is product [3..10^6] a relevant computation for your domain? My understanding was that most of the time one multiplies a handful of polynomials, roughly of the same size with modulo.

You are right - I added this benchmark just to match the product [1 .. 10^8] I saw somewhere else, and this practically never happens.

Bodigrim commented 4 years ago

Here is an output of Bench.Binary (e322e34) on my machine:

Binary/Addition                  90.10 ns   (89.10 ns .. 91.23 ns)
Binary/Multiplication            2.360 ms   (2.312 ms .. 2.411 ms)
Binary/Inversion                 6.877 ms   (6.777 ms .. 6.996 ms)
Binary/Division                  8.973 ms   (8.809 ms .. 9.190 ms)
Binary/Frobenius endomorphism    1.260 ms   (1.245 ms .. 1.278 ms)
Binary/Square root               722.4 ms   (617.0 ms .. 850.7 ms)

Then I switched Data.Field.Galois.Binary to use F2Poly (from bitvec-1.0.1.1) instead of Natural. I cannot really guarantee that I did not mess up (however, tests are still green), but here are benchmarks:

Binary/Addition                  117.5 ns   (115.2 ns .. 120.1 ns)
Binary/Multiplication            34.89 μs   (34.50 μs .. 35.36 μs)
Binary/Inversion                 304.7 μs   (301.5 μs .. 307.6 μs)
Binary/Division                  357.6 μs   (353.3 μs .. 361.8 μs)
Binary/Frobenius endomorphism    13.55 μs   (13.38 μs .. 13.71 μs)
Binary/Square root               9.462 ms   (9.192 ms .. 9.814 ms)

So addition is slightly slower, multiplication is 60x faster, inversion and division are 20x faster, Frobenius endomorphism is 90x faster, square root is 70x faster.

Multramate commented 4 years ago

Thank you! I will replace the functions here with F2Poly functions very soon.

sdiehl / galois-field

Investigate slow binary #19