vacancy / Synchronized-BatchNorm-PyTorch

Synchronized Batch Normalization implementation in PyTorch.
MIT License
1.5k stars 189 forks source link

about fp16 #22

Open 666zz666 opened 5 years ago

666zz666 commented 5 years ago

When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py). tim 20190220212250

vacancy commented 5 years ago

I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.

666zz666 commented 5 years ago

Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.

1. batchnorm.py:

    if self._parallel_id == 0:
        mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))

2.comm.py:

results = self._master_callback(intermediates)

The error is 'An error occured.'

My try-catch like this:

except IOError: print('An error occured trying to read the file.')

except ValueError: print('Non-numeric data found in the file.')

except ImportError: print "NO module found"

except EOFError: print('Why did you do an EOF on me?')

except KeyboardInterrupt: print('You cancelled the operation.')

except: print('An error occured.')

vacancy commented 5 years ago

Can you give detailed information about the "error"?

For example, you may directly wrap the whole function body of forward() with a try-catch statement:

try:
    # original codes
except:
    import traceback
    traceback.print_exc()
666zz666 commented 5 years ago

The detailed information

Traceback (most recent call last): File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size)) File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master results = self._master_callback(intermediates) File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in _data_parallel_master mean, inv_std = self._compute_meanstd(sum, ssum, sum_size) File "/mnt/data-2/data/cnnmulti/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in _compute_meanstd mean = sum / size RuntimeError: value cannot be converted to type at::Half without overflow: 528392

vacancy commented 5 years ago

Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size? Can you double check?

I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.