Speedup poly_Rq_inv in avx2-hps2048509

The commit message has a detailed explanation but it's the same method as used in avx2-hrss701. This pull request brings about a 13 times speedup to poly_Rq_inv which means that it's about 105 times faster than the reference function now.

Also renames a mask in poly_R2_mul in avx2-hrss701 since it has the wrong name (already fixed in the avx2-hps2048509 version).

This should be one of the last larger optimizations for hps2048509 (poly_lift can still be optimized if I remember correctly).

jschanck / ntru

Speedup poly_Rq_inv in avx2-hps2048509 #5