pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.28k stars 22.13k forks source link

res[i].defined() INTERNAL ASSERT FAILED #107928

Closed DarkMnDragon closed 1 year ago

DarkMnDragon commented 1 year ago

🐛 Describe the bug

Epoch 103 / 250:  30%|▎| 86/284 [01:29<03:17,  1.00it/s, gen_ab=0.319, gen_ba=0.284, cycle_a=0.166, cycle_b=0.159, disc_Traceback (most recent call last):
  File "/home/lrl/unext/scripts/train/selfie2anime/cyclegan_selfie2anime-256_repunet.py", line 184, in <module>
    train(args_dict)
  File "/home/lrl/unext/uvcgan/train/train.py", line 71, in train
    metrics = training_epoch(
              ^^^^^^^^^^^^^^^
  File "/home/lrl/unext/uvcgan/train/train.py", line 24, in training_epoch
    model.optimization_step()
  File "/home/lrl/unext/uvcgan/cgan/cyclegan.py", line 215, in optimization_step
    self.backward_discriminators()
  File "/home/lrl/unext/uvcgan/cgan/cyclegan.py", line 161, in backward_discriminators
    self.losses.disc_a = self.backward_discriminator_base(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lrl/unext/uvcgan/cgan/cyclegan.py", line 146, in backward_discriminator_base
    loss += cal_gradient_penalty(
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lrl/unext/uvcgan/base/losses.py", line 132, in cal_gradient_penalty
    gradients = torch.autograd.grad(
                ^^^^^^^^^^^^^^^^^^^^
  File "/home/lrl/anaconda3/envs/unext/lib/python3.11/site-packages/torch/autograd/__init__.py", line 303, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: res[i].defined() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1682343995622/work/torch/csrc/autograd/functions/tensor.cpp":142, please report a bug to PyTorch.

The Code is:

        #
        # NOTE:
        #   This is a workaround to a pytorch 1.9.0 bug that manifests when
        #   cudnn is enabled. When the bug is solved remove no_grad block and
        #   replace `model(fake)` by `model(fake.detach())`.
        #
        #   bug: https://github.com/pytorch/pytorch/issues/48439
        #
        with torch.no_grad():
            fake = fake.contiguous()

        pred_fake = model(fake.detach())

        loss_fake = self.criterion_gan(pred_fake, False)

        loss = (loss_real + loss_fake) * 0.5

        if self.gradient_penalty is not None:
            loss += cal_gradient_penalty(
                model, real, fake, real.device, **self.gradient_penalty
            )[0]

        loss.backward()
        return loss

Versions

PyTorch version: 2.0.1 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home China GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.9.16 (main, May 17 2023, 17:49:16) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22621-SP0 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU Nvidia driver version: 536.99 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture=9 CurrentClockSpeed=2000 DeviceID=CPU0 Family=198 L2CacheSize=14336 L2CacheSpeed= Manufacturer=GenuineIntel MaxClockSpeed=2000 Name=12th Gen Intel(R) Core(TM) i7-12800HX ProcessorType=3 Revision=

Versions of relevant libraries: [pip3] gpytorch==1.11 [pip3] numpy==1.25.0 [pip3] torch==2.0.1 [pip3] torch-fidelity==0.3.0 [pip3] torchaudio==2.0.2 [pip3] torchvision==0.15.2 [conda] blas 1.0 mkl [conda] gpytorch 1.11 pypi_0 pypi [conda] mkl 2023.1.0 h8bd8f75_46356 [conda] mkl-service 2.4.0 py39h2bbff1b_1 [conda] mkl_fft 1.3.6 py39hf11a4ad_1 [conda] mkl_random 1.2.2 py39hf11a4ad_1 [conda] numpy 1.25.0 py39h055cbcc_0 [conda] numpy-base 1.25.0 py39h65a83cf_0 [conda] pytorch 2.0.1 py3.9_cuda11.8_cudnn8_0 pytorch [conda] pytorch-cuda 11.8 h24eeafa_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch-fidelity 0.3.0 pypi_0 pypi [conda] torchaudio 2.0.2 pypi_0 pypi [conda] torchvision 0.15.2 pypi_0 pypi

cc @ezyang @gchanan @zou3519 @albanD @gqchen @pearu @nikitaved @soulitzer @Lezcano @Varal7

soulitzer commented 1 year ago

Thanks for the report, here's a possible repro of that error:

a = torch.tensor([1.], requires_grad=True)
c = a.clone()
v = c[:]
b = torch.tensor(1., requires_grad=True)

class InplaceFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, other):
        ctx.mark_dirty(x)
        return x.mul_(2)

    @staticmethod
    def backward(ctx, grad):
        return grad, None

out = InplaceFunc.apply(v, b)

torch.autograd.grad(out, inputs=(a, b))
soulitzer commented 1 year ago

What's happening here is that

One thing we could do is just remove the error, since we should be treating undefined tensors as zeros anyway. If we don't wish to allow this, we should at least be making the internal assert a normal torch check.

albanD commented 1 year ago

properly treating undefined as zero gradients sounds fair to me. I guess it just happens that our inplace ops (that get put inside CopySlices) are well behaved and never do this. But custom Function can so we should support it.