Alternate quantization to prevent int16_t overflows

Bench: 4611949

In order to prevent int16_t overflows, this new network (the same as the previous) is quantized with 32 in the input layer.

ELO | 0.61 +- 1.95 (95%) SPRT | 6.0+0.06s Threads=1 Hash=8MB LLR | 2.95 (-2.94, 2.94) [-2.50, 0.50] GAMES | N: 58472 W: 14105 L: 14003 D: 30364 http://chess.grantnet.us/test/33600/

Along with the above, this patch also improves the speed slightly of sparse matrix multiplication, bringing a concept over from the regular matrix multiplication, which minimizes the expansion of int16_t -> int32_t by adding them together in pairs. This tested slightly faster and showed no bench variation (checked up to depth 24). It is also unlikely this causes an overflow in Berserk as the ReLU architecture doesn't clamp values at int8_t max like other engines who may experience this issue.

              maddubs               |             maddubs2x              |             maddubs4x              |
        mu              sigma       |        mu              sigma       |        mu              sigma       |   Sp(1)/Sp(2)      3*sigma   
------------------------------------+------------------------------------+------------------------------------+------------------------------------
       1834669.000             0.000|       1842479.000             0.000|       1840055.000             0.000|      -0.424 %  +/-  0.000 %
       1833467.000          1699.885|       1849915.500         10516.799|       1844547.000          6352.647|      -0.887 %  +/-  1.966 %
       1834980.667          2884.157|       1847872.667          8235.353|       1847300.667          6551.802|      -0.696 %  +/-  1.709 %
       1832829.000          4905.533|       1847506.750          6763.845|       1847970.500          5514.718|      -0.793 %  +/-  1.513 %
       1834088.000          5096.430|       1848869.400          6602.748|       1848598.400          4977.989|      -0.799 %  +/-  1.310 %

Worth noting there is no improvement going to 4x

jhonnold / berserk

Alternate quantization to prevent int16_t overflows #508