sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
296 stars 33 forks source link

ModuleNotFoundError: No module named 'flash_attn.models.falcon' #22

Open Sniper970119 opened 9 months ago

Sniper970119 commented 9 months ago

I ran the bash scripts/setup_flash.sh without error (but it cost just a few minute)

image

But I got a wrong message when I run the bash scripts/run_pile.sh

Traceback (most recent call last):
  File "doremi/train.py", line 56, in <module>
    import doremi.models as doremi_models
  File "/storage/home/lanzhenzhongLab/zhaoyu/doremi/doremi/models.py", line 8, in <module>
    from flash_attn.models.gpt import GPTLMHeadModel as GPTLMHeadModelFlash
  File "/storage/home/lanzhenzhongLab/zhaoyu/.conda/envs/zy_doremi/lib/python3.8/site-packages/flash_attn-2.0.4-py3.8-linux-x86_64.egg/flash_attn/models/gpt.py", line 31, in <module>
    from flash_attn.models.falcon import remap_state_dict_hf_falcon
ModuleNotFoundError: No module named 'flash_attn.models.falcon'

what wrong with this?

also,I found something wrong with bash scripts/run_preprocess_pile.sh until I update the packages datasets to 2.15.0 , but version in setup.py is 2.10.1. Is something wrong in my operate?

sangmichaelxie commented 9 months ago

Just pushed an update that should fix the import issue. Could you add more details on the preprocess issue?

Sniper970119 commented 9 months ago

tks for ur help! I will try it later.

The preprocessing issue is as follows: it can be successfully run by updating datasets to 2.15.0

Traceback (most recent call last):
  File "scripts/preprocess_pile.py", line 135, in <module>
    main()
  File "scripts/preprocess_pile.py", line 87, in main
    ds = load_dataset('json',
  File "/xxxx/zy_doremi/lib/python3.8/site-packages/datasets/load.py", line 1775, in load_dataset
    return builder_instance.as_streaming_dataset(split=split)
  File "/sxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/builder.py", line 1234, in as_streaming_dataset
    raise NotImplementedError(
NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.
Sniper970119 commented 9 months ago

Hello, another error when I run the lastest code.

File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/sxxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/xxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/sxxxxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/xxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 807, in __iter__
    for element in self.dataset:
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1358, in __iter__
    yield from self._iter_pytorch()
  File "/sxxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1293, in _iter_pytorch
    for key, example in ex_iterable:
  File "/xxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/packaged_modules/generator/generator.py", line 30, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "/sxxxx/doremi/doremi/dataloader.py", line 339, in take_data_generator
    for ex in ds:
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1358, in __iter__
    yield from self._iter_pytorch()
  File "/sxxxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1274, in _iter_pytorch
    ex_iterable = ex_iterable.shard_data_sources(worker_id=worker_info.id, num_workers=worker_info.num_workers)
TypeError: shard_data_sources() got an unexpected keyword argument 'worker_id'

Wandb has been init ,and have this error. I run the code with a 8*A100 GPUs machine. It something wrong with my config? or the wrong version I used.(maybe version with datasets?)

also, the training args have a sooooooo big num epochs, but the num epoch in training_args is still 3.

[ERROR|tokenization_utils_base.py:1042] 2023-12-26 19:04:00,614 >> Using pad_token, but it is not set yet.
[INFO|trainer.py:543] 2023-12-26 19:04:07,514 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:597] 2023-12-26 19:04:07,515 >> Using cuda_amp half precision backend
[INFO|trainer.py:1740] 2023-12-26 19:04:12,755 >> ***** Running training *****
[INFO|trainer.py:1741] 2023-12-26 19:04:12,755 >>   Num examples = 102400000
[INFO|trainer.py:1742] 2023-12-26 19:04:12,755 >>   Num Epochs = 9223372036854775807
[INFO|trainer.py:1743] 2023-12-26 19:04:12,755 >>   Instantaneous batch size per device = 64
[INFO|trainer.py:1744] 2023-12-26 19:04:12,755 >>   Total train batch size (w. parallel, distributed & accumulation) = 512
[INFO|trainer.py:1745] 2023-12-26 19:04:12,755 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1746] 2023-12-26 19:04:12,755 >>   Total optimization steps = 200000
[INFO|trainer.py:1747] 2023-12-26 19:04:12,755 >>   Number of trainable parameters = 123671040
sangmichaelxie commented 9 months ago

Yes it's because of the different version of datasets. You could try doing pip install fsspec==2023.9.2 with the original datasets version.

The num_epochs should be ignored - the training terminates on steps.

Sniper970119 commented 9 months ago

tksss for ur help!! I can run the run_pile_baseline120M.sh now. But I found that something wrong when I run the run_pile_doremi120M.sh seem like something wrong in backward.

File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "xxx/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/doremi/doremi/trainer.py", line 386, in training_step
    loss.backward()
  File "/xxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/xxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Sorry for asking so many run-related questions. I really can't resolve these issues.😭

Sniper970119 commented 9 months ago

All package version are same to the setup.py

image

sangmichaelxie commented 9 months ago

I'm not totally sure, but sometimes these errors can be mitigated by uninstalling and doing a fresh install of torch / flash-attn. You could also try running on CPU or with CUDA_LAUNCH_BLOCKING=1 to see if it will give a better trace. It could also be worth checking your CUDA version vs. which CUDA version your pytorch is compiled with.

Sniper970119 commented 8 months ago

The code can't run with CPU mode. And I have rebuilt the conda environment for several times. It have a lot of error log with

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Is this right? My cuda version is 11.7 and torch.cuda.is_available is True

I can run baseline successful, but wrong in doremi120M

sangmichaelxie commented 8 months ago

Do you have a more detailed trace? What part of the code raises this error?

Sniper970119 commented 8 months ago

sure,there are the detail error message. run on a *8A100 server

  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb982d8c4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb982d5636b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb982e30fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fb982e017bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fb982e10d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fb9c22c2b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fb982d71e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fb982d6a69e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb982d6a7b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fb9c2548188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb9c2548535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fba0928a6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc421b4e4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc421b1836b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc421bf2fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fc421bc37bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fc421bd2d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fc461084b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fc421b33e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fc421b2c69e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc421b2c7b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fc46130a188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fc46130a535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fc4a804f6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [48,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f19ad7564d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f19ad72036b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f19ad7fafa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7f19ad7cb7bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f19ad7dad80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7f19ecc8cb86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f19ad73be77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f19ad73469e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f19ad7347b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7f19ecf12188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f19ecf12535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f1a33c516a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb1eefcb4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb1eef9536b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb1ef06ffa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fb1ef0407bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fb1ef04fd80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fb22e501b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fb1eefb0e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fb1eefa969e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb1eefa97b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fb22e787188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb22e787535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fb2754c66a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f545cb834d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f545cb4d36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f545cc27fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f545db0f410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f545db129e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f545db13f37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f54db71bb73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f54e3bb32de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f54e315ae83 in /lib64/libc.so.6)

Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44b65cf4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44b659936b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44b6673fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f44b755b410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f44b755e9e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f44b755ff37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f4535197b73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f453d62f2de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f453cbd6e83 in /lib64/libc.so.6)

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6106bf04d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f6106bba36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f6106c94fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f6107b7c410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f6107b7f9e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f6107b80f37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f618578ab73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f618dc222de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f618d1c9e83 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2cdf7224d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2cdf6ec36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2cdf7c6fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7f2cdf7977bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f2cdf7a6d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7f2d1ec58b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f2cdf707e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f2cdf70069e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2cdf7007b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7f2d1eede188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f2d1eede535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f2d65c1e6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

wandb: - 0.017 MB of 0.017 MB uploaded (0.000 MB deduped)
Richard-Wth commented 3 months ago

I face the same issue, but it worked when I lowered batch_size.