Closed jarbus closed 1 year ago
try the variance of 1/Ni
using a custom glorot_normal initializer that matches the paper. also, I was generating parameters sampling 1xlength glorot_normal vectors, but that yields different metrics than something like sqrt(length)xsqrt(length)
this yields diverse argmax
model size makes network less variable w.r.t. inputs small models will change outputs much more than large models, large models do the same action every step where small models do not