mapillary / inplace_abn

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs
BSD 3-Clause "New" or "Revised" License
1.32k stars 187 forks source link

Function 'InPlaceABNBackward' returned nan values in its 1th output #172

Open MendelXu opened 4 years ago

MendelXu commented 4 years ago

I'm training a semantic segmentation model. And I replace the batch norm layer in the backbone resnet101 with inplaceabnsync. However, it raises an error below. Function 'InPlaceABNBackward' returned nan values in its 1th output I have looked at https://github.com/mapillary/inplace_abn/issues/4 but I find that the fixed code disappears in the newest version. So should I add it again or is there another solution?

ducksoup commented 4 years ago

@MendelXu the code for #4 has moved here: https://github.com/mapillary/inplace_abn/blob/master/inplace_abn/functions.py#L136 Anyway, the error you are reporting is probably not connected to that, as the 1th output of InPlaceABNBackward mentioned in the error string is the gradient w.r.t. the layer's input, and not its weights as in #4.

Con you by any chance manage to save a snapshot of the layers inputs and outputs when the error occurs and provide it to us for debugging?