dtype mismatch when after using auto mixed precision

shamafiras commented 2 years ago

🐛 Describe the bug

I am training model that includes 2 fully connected networks. So far I have not seen any issue with them, and I wanted to use a mixed precision training to accelerate the process. I made the required changes to the code but strangely, after several epochs of successful training and loss curve behavious, I get the following error:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "<path to my code>.py", line 2146, in <module>
    train()
  File "<path to my code>.py", line 2007, in train
    scaler.scale(loss_total).backward()
  File "C:\Users\shama\Anaconda3\envs\pytorch\lib\site-packages\torch\_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\shama\Anaconda3\envs\pytorch\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: masked_scatter: expected self and source to have same dtypes but gotHalf and Float

I wonder why would that happen in the middle of training ( after 20% of the training )

The code is huge, but here's its basic structure:

 model = Network()
 # different params trained with different learning rate, which is set in the training loop
 optimizer = pt.optim.Adam([
      {'params': paramsA, 'lr': 0},
      {'params': paramsB, 'lr': 0},
      {'params': paramsC, 'lr': 0} ])
 writer = SummaryWriter(..)
 for epoch in range(start_epoch, args.epochs+1):
    epoch_loss_total = 0
    epoch_mse = 0

    model.train()

    for i, sample in enumerate(dataloader_train):
      # print("step: {}".format(i))
      setLearningRate(optimizer, epoch)
      optimizer.zero_grad()

      sel = selectPixels()   

      gt = sample ['image']
      gt = gt.view(gt.shape[0], gt.shape[1], gt.shape[2] * gt.shape[3])
      gt = gt[:, :, sel, None].cuda()

      with autocast(enabled=args.mixed_precision):

        output, in_range_selection = model(sample, sel)
        gt = gt[:, :, in_range_selection,:]

        mse = pt.mean((output - gt) ** 2)

        loss_total = mse

        loss2 = 0
        if SomeCondition1():
          loss2= args.l2_weight * pt.mean((pt.sigmoid(model.feature)))        

        loss_total = loss_total + loss2

        # another loss
        loss_total = loss_total + args.l3_weight * func(gt, output)

        # for loss statistics
        epoch_loss_total += loss_total
        epoch_mse += mse

      # this section is out of autocast
      scaler.scale(loss_total).backward()

      scaler.step(optimizer)

      scaler.update()

      step += 1
    # this section is out of samples loop 
    if someCondition2():
      with pt.no_grad():
        out_image = evaluateResults1(model)

        writer.add_image('images/res', out_image , 2), epoch)        
        pt.cuda.empty_cache()

    if someCondition3():
      dumpResults(model)
      pt.cuda.empty_cache()

    # report mean value
    mean = pt.mean(model.mlp)
    writer.add_scalar('loss/mean', mean, epoch)

    if someCondition4():
      checkpoint(ckpt, model, optimizer, epoch+1)

Versions

Collecting environment information... PyTorch version: 1.9.0 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Home GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19044-SP0 Is CUDA available: True CUDA runtime version: 11.3.109 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Nvidia driver version: 472.39 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.20.2 [pip3] torch==1.9.0 [pip3] torch-tb-profiler==0.2.1 [pip3] torchvision==0.10.0 [conda] blas 1.0 mkl [conda] cudatoolkit 11.1.1 heb2d755_7 conda-forge [conda] mkl 2021.2.0 haa95532_296 [conda] mkl-service 2.3.0 py38h2bbff1b_1 [conda] mkl_fft 1.3.0 py38h277e83a_2 [conda] mkl_random 1.2.1 py38hf11a4ad_2 [conda] numpy 1.20.2 py38ha4e8547_0 [conda] numpy-base 1.20.2 py38hc2deb75_0 [conda] pytorch 1.9.0 py3.8_cuda11.1_cudnn8_0 pytorch [conda] torch-tb-profiler 0.2.1 pypi_0 pypi [conda] torchvision 0.10.0 py38_cu111 pytorch

cc @mcarilli @ptrblck

ptrblck commented 2 years ago

I wonder why would that happen in the middle of training ( after 20% of the training )

I don't know why the error should be raised after a few successful iterations and can't speculate on the root cause. If you are seeing the same issue using the latest stable or nightly release, please ping me once you can provide a minimal, executable code snippet so that we could debug it, please.

shamafiras commented 2 years ago

@ptrblck Thanks for your support. I tried running the same process on latest stable (versions below). still got the same error. Unfortunately, it would take much time to deliver a minimal code that reproduce the error. but I'll try.

Collecting environment information...
PyTorch version: 1.12.0
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15)
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.26

Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-4.14.285-215.501.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 510.73.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.12.0
[pip3] torch-model-archiver==0.5.3b20220226
[pip3] torch-workflow-archiver==0.2.4b20220513
[pip3] torchaudio==0.12.0
[pip3] torchserve==0.6.0b20220513
[pip3] torchtext==0.13.0
[pip3] torchvision==0.13.0
[conda] blas                      2.115                       mkl    conda-forge
[conda] blas-devel                3.9.0            15_linux64_mkl    conda-forge
[conda] captum                    0.5.0                         0    pytorch
[conda] cudatoolkit               11.6.0              hecad31d_10    conda-forge
[conda] libblas                   3.9.0            15_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            15_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            15_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            15_linux64_mkl    conda-forge
[conda] magma-cuda116             2.6.1                         0    pytorch
[conda] mkl                       2022.1.0           h84fe81f_915    conda-forge
[conda] mkl-devel                 2022.1.0           ha770c72_916    conda-forge
[conda] mkl-include               2022.1.0           h84fe81f_915    conda-forge
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] pytorch                   1.12.0          py3.9_cuda11.6_cudnn8.3.2_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch-model-archiver      0.5.3                    py39_0    pytorch
[conda] torch-workflow-archiver   0.2.4                    py39_0    pytorch
[conda] torchaudio                0.12.0               py39_cu116    pytorch
[conda] torchserve                0.6.0                    py39_0    pytorch
[conda] torchtext                 0.13.0                     py39    pytorch
[conda] torchvision               0.13.0               py39_cu116    pytorch

shamafiras commented 1 year ago

Hey, It took me a while, but I was able to reproduce the issue on available Colab notebook after I applied mixed precision. To reproduce the bug:

Use the following Colab https://colab.research.google.com/drive/1hXVvYdAwLA0EFg2zrafJUE0bFgB_F7PU#scrollTo=TFbN4mrJCp8o&sandboxMode=true
in the Configuration section, set epochs as 100 ( to give it some time to crash )
run the notebook cells ( setups mainly ) until the training part.
open the training file under Files>>nex-code\train.py and add the auto mixed precision hints, as I did it on lines 632 changed.txt
add the required import , and scaler definition scaler = GradScaler() somewhere above
run the training cell & wait. for me it failed after 38% of the epochs: