ResNeSt say they use Kaiming Normal, but the code uses the PyTorch default which is Kaiming Uniform with a=sqrt(5).
There are a few options going forward:
Set all the models to use Kaiming [Normal or Uniform], which has been shown to work best with ReLU activations. With this decision, we'll probably deviate from the torch default gain (which is for LeakyReLU) to a gain that is suitable to vanilla ReLU.
Set all the models to the initialisation provided in their respective papers.
Provide no default, force users to select one, but provide suggestions for suitable candidates in the docstring.
Currently, all layers use the default Flax initialization. However, each paper uses a different strategy:
There are a few options going forward: