RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU.

cjxjxjx commented 4 months ago

Issue Description I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of DALLE2_pytorch in a GPU docker environment. The training runs without issues when the --accuracy flag is not used.

Steps to Reproduce python install.py DALLE2_pytorch python run.py DALLE2_pytorch -d cuda -t train --accuracy

Expected Behavior The training process should run without errors and perform accuracy checks without causing runtime errors.

Actual Behavior The script executes successfully without the --accuracy flag. However, when the accuracy check is enabled, it fails with the following error message:

fp64 golden ref were not generated for DALLE2_pytorch. Setting accuracy check to cosine
element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
  File "/benchmark/torchbenchmark/util/env_check.py", line 635, in check_accuracy
    correct_result = run_n_iterations(
  File "/benchmark/torchbenchmark/util/env_check.py", line 504, in run_n_iterations
    _model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
  File "/benchmark/torchbenchmark/util/env_check.py", line 497, in _model_iter_fn
    return forward_and_backward_pass(
  File "/benchmark/torchbenchmark/util/env_check.py", line 480, in forward_and_backward_pass
    DummyGradScaler().scale(loss).backward(retain_graph=True)
  File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Additional Context PyTorch version: 2.2.2 CUDA version: 12.4.0.041

xuzhao9 commented 4 months ago

I can confirm that this can be reproduced in the docker environment. @FindHao Can you help take a look at this issue?

FindHao commented 4 months ago

@xuzhao9 The problem also occurs on the previous version of TorchBench(ghcr.io/pytorch/torchbench:dev20230619). It looks like it is from the first time DALLE2 was included in TorchBench. I'm not sure if we can fix it on our side or from the upstream repo since we have limited control over the model's init.py. I'll have a try.

xuzhao9 commented 2 months ago

We are dropping DALLE2_pytorch because it does not support numpy 2.0: https://github.com/pytorch/benchmark/pull/2311

pytorch / benchmark

RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241