k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
931 stars 295 forks source link

RuntimeError: CUDA error: invalid configuration argument #702

Closed huangruizhe closed 1 year ago

huangruizhe commented 1 year ago

When I was trying a zipformer (pruned_transducer_stateless7) on spgispeech, I did the following:

python pruned_transducer_stateless7/train.py --world-size 2 --max-duration 250

I got the following error after the training run for a while:

2022-11-22 23:32:27,044 INFO [train.py:876] (1/2) Epoch 1, batch 14850, loss[loss=0.4185, simple_loss=0.4207, pruned_loss=0.2081, over 4933.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.3939, pruned_loss=0.1859, over 1194952.18 frames. ], batch size: 40, lr: 2.82e-02,
2022-11-22 23:32:27,046 INFO [train.py:876] (0/2) Epoch 1, batch 14850, loss[loss=0.3625, simple_loss=0.366, pruned_loss=0.1795, over 6045.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.3933, pruned_loss=0.1856, over 1200748.57 frames. ], batch size: 17, lr: 2.82e-02,
2022-11-22 23:32:33,171 INFO [zipformer.py:1414] (0/2) attn_weights_entropy = tensor([2.0664, 2.2180, 2.0900, 2.1552, 1.5869, 1.5967, 1.0766, 1.7589],
       device='cuda:0'), covar=tensor([0.0361, 0.0796, 0.3119, 0.0494, 0.0348, 0.0510, 0.0635, 0.0371],
       device='cuda:0'), in_proj_covar=tensor([0.0022, 0.0021, 0.0021, 0.0026, 0.0021, 0.0027, 0.0025, 0.0024],
       device='cuda:0'), out_proj_covar=tensor([3.2804e-05, 3.6589e-05, 3.2693e-05, 4.0759e-05, 3.4965e-05, 4.2606e-05,
        4.0940e-05, 3.8753e-05], device='cuda:0')
2022-11-22 23:32:38,075 INFO [train.py:1134] (0/2) Saving batch to pruned_transducer_stateless7/exp/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
2022-11-22 23:32:38,115 INFO [train.py:1140] (0/2) features shape: torch.Size([40, 621, 80])
2022-11-22 23:32:38,117 INFO [train.py:1144] (0/2) num tokens: 1462
Traceback (most recent call last):
  File "pruned_transducer_stateless7/train.py", line 1207, in <module>
    main()
  File "pruned_transducer_stateless7/train.py", line 1198, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/hltcoe/rhuang/icefall/egs/spgispeech/ASR/pruned_transducer_stateless7/train.py", line 1078, in run
    train_one_epoch(
  File "/home/hltcoe/rhuang/icefall/egs/spgispeech/ASR/pruned_transducer_stateless7/train.py", line 809, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seems not an OOM error. If setting --max-duration 300, this error can happen at batch 50. On the other hand, if I try --max-duration 100 as default, it goes well after many batches but the GPU memory usage is very low. Do you know what may be the issue?

csukuangfj commented 1 year ago

which version of cuda and pytorch are you using?

huangruizhe commented 1 year ago

CUDA 11.1 and Pytorch 1.10.0

csukuangfj commented 1 year ago

Could you switch to another cuda version, e.g., cuda 10.2?

RuntimeError: CUDA error: invalid configuration argument

Most people are using cuda 11.1 when they have such an issue.

huangruizhe commented 1 year ago

Sure, I will try. Thanks for the suggestion!

csukuangfj commented 1 year ago

For future reference, the following issues are related to this one using cuda 11.1

danpovey commented 1 year ago

Looks like this is most likely a PyTorch bug that we just happen to be triggering, so probably would be easiest to try different versions of PyTorch and/or CUDA because we would not be able to fix this ourselves.

huangruizhe commented 1 year ago

After we switch to CUDA 10.2, the issue is resolved. Thanks a lot!

desh2608 commented 1 year ago

(We can use --max-duration 600 and GPU memory utilization is very good.)