Open 666zz666 opened 5 years ago
I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.
Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.
if self._parallel_id == 0:
mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))
results = self._master_callback(intermediates)
except IOError: print('An error occured trying to read the file.')
except ValueError: print('Non-numeric data found in the file.')
except ImportError: print "NO module found"
except EOFError: print('Why did you do an EOF on me?')
except KeyboardInterrupt: print('You cancelled the operation.')
except: print('An error occured.')
Can you give detailed information about the "error"?
For example, you may directly wrap the whole function body of forward()
with a try-catch statement:
try:
# original codes
except:
import traceback
traceback.print_exc()
Traceback (most recent call last): File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size)) File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master results = self._master_callback(intermediates) File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in _data_parallel_master mean, inv_std = self._compute_meanstd(sum, ssum, sum_size) File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in _compute_meanstd mean = sum / size RuntimeError: value cannot be converted to type at::Half without overflow: 528392
Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size
? Can you double check?
I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.
When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py).