[WIP] Correct feature_transformer sgd and do ADA-like lr scaling per feature

This PR is mainly for discussion and ideas. It has working code that achieved better results in short training runs.

The first change relates to #276. The second change works on top of the fixed SGD and tries to vary the learning rate for individual features depending on their prevalence. The exact formulae used are only vaguely correct and might be improved, not my field.

When applied to halfkae5 architecture it shows measurable gains in short training sessions. Here trained 100M positions and tested vs baseline:

Score of tuna vs stockfish: 365 - 302 - 333  [0.531] 1000
Elo difference: 21.92 +/- 17.59

During training one can observe the avg_abs_weight for the feature transformer going up much quicker, which suggests the learning of this layer is accelerated. baseline: https://pastebin.com/4RqXPrNG pr: https://pastebin.com/u1fzDKj8

nodchip / Stockfish

[WIP] Correct feature_transformer sgd and do ADA-like lr scaling per feature #282