Closed pzelasko closed 3 years ago
Can you check that there are no inf
or nan
values in the network output nnet_output
?
There will definitely be inf's or nan's in the nnet output. I think the mis-ordering is likely just the stdout and stderr getting mixed up due to buffering. I seem to remember I had this problem with an alimdl at one point, generating inf's/nan's and I'm trying to figure out how I solved it.
I think maybe it was the following: if the alimdl has batch-norm and CNN layers, and if we are unlucky, the size of the activations can increase geometrically with depth; and if we are using half-precision, there is some chance that this will lead to numerical overflow. The easieset fix is just to test for this and ignore those batches.
Cool, that makes sense. I don't think I can skip a batch in multi-GPU training because that will freeze due to skipping synchronization points in optimizer/metric aggregation by one of the processes. I will set the alimdl nan/inf to zero before adding to nnet_output, hopefully, that helps and is unlikely to break things.
This is very awkward. When I submit the training job in the COE grid (and this started happening only when I added alimdl), I get the following error. The weird thing is that both validation and training seemed to run for some time after it, when it finally "reached" Python...
I wanted to debug it with gdb/cuda-gdb, but when I grabbed an interactive session, the training just worked and got past that. I tried several times with a different number of GPUs and it never happens in the interactive session, and always happens in the submitted job. I am out of ideas but posting it here in case it rings a bell for somebody.