Can one past the test ?

Hello. Thanks for the awesome code. I read this code base and the adaptation from zhanghang1989 . His code customs two operators for saving gpu memory. I've just combined the cuda extension for PyTorch 0.4.1. Although I have past the gradcheck of the operator bn and the opeartor sum_square, and for each operator I've compared the operator ouput with the ouput from imperative implementation using PyTorch, I can not past the test case provided here... May you have any suggestion about the numeric stability ?

vacancy / Synchronized-BatchNorm-PyTorch

Can one past the test ? #17