[Museformer]RuntimeError CUDA error on RTX3090

ZZDoog commented 1 year ago

Hello! When I run the ttrain/mf-lmd6remi-1.sh on the docker provided by the author on a RTX3090, an error occurs.

Traceback (most recent call last):
  File "/opt/miniconda/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq_cli/train.py", line 352, in cli_main
    distributed_utils.call_main(args, main)
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 301, in call_main
    main(args, **kwargs)
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq_cli/train.py", line 125, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/opt/miniconda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq_cli/train.py", line 208, in train
    log_output = trainer.train_step(samples)
  File "/opt/miniconda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq/trainer.py", line 512, in train_step
    raise e
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq/trainer.py", line 480, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 416, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq/criterions/cross_entropy.py", line 35, in forward
    net_output = model(**sample["net_input"])
  File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/miniconda/lib/python3.8/site-packages/fairseq/models/fairseq_model.py", line 481, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/museformer/museformer/museformer_decoder.py", line 413, in forward
    x, extra = self.extract_features(
  File "/root/museformer/museformer/museformer_decoder.py", line 645, in extract_features
    (sum_x, reg_x), inner_states = self.run_layers(
  File "/root/museformer/museformer/museformer_decoder.py", line 731, in run_layers
    x, _ = layer(
  File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/museformer/museformer/museformer_decoder_layer.py", line 339, in forward
    x, attn = self.run_self_attn(
  File "/root/museformer/museformer/museformer_decoder_layer.py", line 404, in run_self_attn
    r, weight = self.self_attn(
  File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/museformer/museformer/attention/self_attention_v2s1/rpe_self_attention_v2s1.py", line 265, in forward
    attn_scores_inc_rx = self.do_qk_scores_for_rx(
  File "/root/museformer/museformer/attention/self_attention_v2s1/blocksparse_rpe_self_attention_v2s1.py", line 49, in do_qk_scores_for_rx
    return do_qk_scores_for_part(self, reg_q, k, bsz, reg_len, sum_len + reg_len, attn_mask, 'rx')
  File "/root/museformer/museformer/attention/common/blocksparse_common_operations/qk_mul/qk_mul_1.py", line 49, in do_qk_scores_for_part
    sample_attn_scores = do_sample_qk_scores_base(
  File "/root/museformer/museformer/attention/common/blocksparse_common_operations/qk_mul/qk_mul_1.py", line 24, in do_sample_qk_scores_base
    sdd_matmul = BlocksparseMatMul(sample_layout, self.block_size, 'sdd',
  File "/root/museformer/museformer/blocksparse/optimized_matmul.py", line 515, in __init__
    self.c_lut, self.c_width = sdd_lut(layout, block, device)
  File "/root/museformer/museformer/blocksparse/optimized_matmul.py", line 119, in sdd_lut
    lut = layout.nonzero(as_tuple=False).to(device).int()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I thought this is because lack of cuda memory, but this program is runing one sample per GPU, so I reduce the

 --tokens-per-sample 100000 \
  --truncate-train 15360 \
  --truncate-valid 10240 \

by half but the error still occurs. I also try to change the UPDATE_FREQ=4 because I use one single GPU but still didn't help. I wonder whether this model can only train on V100 or other GPU whose memory bigger than 32GB or this error is caused by other problem.

btyu commented 1 year ago

Did the error happen at the beginning of the training, or it went well for a while?

ZZDoog commented 1 year ago

Did the error happen at the beginning of the training, or it went well for a while?

It went well for a while. Always happen during the first epoch.

btyu commented 1 year ago

Noted. I am looking into it. Thanks for the information.

microsoft / muzic

[Museformer]RuntimeError CUDA error on RTX3090 #108