Create avx2-hps2048509 directory and implement poly_Rq_mul using avx2

Explanation: A similar strategy to avx2-hrss701 is used for the multiplication. The polynomial multiplication applies Toom-Cook 4-way by splitting both polynomials with 512 coefficients (nearest multiple of 32 to make implementation easier) into polynomials with 4 limbs where each limb is 128 coefficients. These polynomials are then evaluated at 7 points to form 7 128 coefficient polynomials. Multiplying these 7 points with the 7 points of the other polynomial gives us 7 points on the resulting polynomial.

This means we have to do 7 polynomial multiplications of 128 coefficients. To make this easier we apply two instances of Karatsuba so that we have to do 63 multiplications of 32 coefficient polynomials. Viewing this as a matrix means we can transpose and do multiple multiplications in a single AVX2 register. Multiplying 32 coefficient polynomials still takes too long so apply Karatsuba three more times so that we are left with multiplications of 4 coefficient polynomials which we can do using the schoolbook method.

The result of this is that we get 7 points on the resulting polynomial. Since we know what points we evaluated and the result of those evaluations we can find the coefficients of the resulting polynomial (thus finding the resulting polynomial). This can be done by using gaussian elimination or in this case by using the fact that the point evaluation is invertible (due to the fact we chose our points to have an invertible matrix). This step is called interpolation. The resulting polynomial of Toom-Cook has 1024 coefficients but is mapped to 512 coefficients. During all of these steps we account for the fact that we're actually working with 509 coefficients and not 512.

The code can be improved a little bit in some places but is already so incredibly complex that I've left it out (for now).

Results: On an Intel i5-8250u using gcc 8.3.0 the reference poly_Rq_mul takes about 285000 cycles on average whereas the avx2 version takes about 3600 cycles on average.

This is about an 80 times speedup. I expect comparable speedups on other AVX2 capable processors.

jschanck / ntru

Create avx2-hps2048509 directory and implement poly_Rq_mul using avx2 #2