Open Sniper970119 opened 9 months ago
Just pushed an update that should fix the import issue. Could you add more details on the preprocess issue?
tks for ur help! I will try it later.
The preprocessing issue is as follows: it can be successfully run by updating datasets
to 2.15.0
Traceback (most recent call last):
File "scripts/preprocess_pile.py", line 135, in <module>
main()
File "scripts/preprocess_pile.py", line 87, in main
ds = load_dataset('json',
File "/xxxx/zy_doremi/lib/python3.8/site-packages/datasets/load.py", line 1775, in load_dataset
return builder_instance.as_streaming_dataset(split=split)
File "/sxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/builder.py", line 1234, in as_streaming_dataset
raise NotImplementedError(
NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.
Hello, another error when I run the lastest code.
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/sxxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
data = self._next_data()
File "/xxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/sxxxxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/xxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 807, in __iter__
for element in self.dataset:
File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1358, in __iter__
yield from self._iter_pytorch()
File "/sxxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1293, in _iter_pytorch
for key, example in ex_iterable:
File "/xxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
yield from self.generate_examples_fn(**self.kwargs)
File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/packaged_modules/generator/generator.py", line 30, in _generate_examples
for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
File "/sxxxx/doremi/doremi/dataloader.py", line 339, in take_data_generator
for ex in ds:
File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1358, in __iter__
yield from self._iter_pytorch()
File "/sxxxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1274, in _iter_pytorch
ex_iterable = ex_iterable.shard_data_sources(worker_id=worker_info.id, num_workers=worker_info.num_workers)
TypeError: shard_data_sources() got an unexpected keyword argument 'worker_id'
Wandb has been init ,and have this error. I run the code with a 8*A100 GPUs machine. It something wrong with my config? or the wrong version I used.(maybe version with datasets
?)
also, the training args have a sooooooo big num epochs, but the num epoch in training_args
is still 3.
[ERROR|tokenization_utils_base.py:1042] 2023-12-26 19:04:00,614 >> Using pad_token, but it is not set yet.
[INFO|trainer.py:543] 2023-12-26 19:04:07,514 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:597] 2023-12-26 19:04:07,515 >> Using cuda_amp half precision backend
[INFO|trainer.py:1740] 2023-12-26 19:04:12,755 >> ***** Running training *****
[INFO|trainer.py:1741] 2023-12-26 19:04:12,755 >> Num examples = 102400000
[INFO|trainer.py:1742] 2023-12-26 19:04:12,755 >> Num Epochs = 9223372036854775807
[INFO|trainer.py:1743] 2023-12-26 19:04:12,755 >> Instantaneous batch size per device = 64
[INFO|trainer.py:1744] 2023-12-26 19:04:12,755 >> Total train batch size (w. parallel, distributed & accumulation) = 512
[INFO|trainer.py:1745] 2023-12-26 19:04:12,755 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1746] 2023-12-26 19:04:12,755 >> Total optimization steps = 200000
[INFO|trainer.py:1747] 2023-12-26 19:04:12,755 >> Number of trainable parameters = 123671040
Yes it's because of the different version of datasets
. You could try doing pip install fsspec==2023.9.2
with the original datasets
version.
The num_epochs should be ignored - the training terminates on steps.
tksss for ur help!! I can run the run_pile_baseline120M.sh
now.
But I found that something wrong when I run the run_pile_doremi120M.sh
seem like something wrong in backward.
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "xxx/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/doremi/doremi/trainer.py", line 386, in training_step
loss.backward()
File "/xxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/xxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Sorry for asking so many run-related questions. I really can't resolve these issues.😭
All package version are same to the setup.py
I'm not totally sure, but sometimes these errors can be mitigated by uninstalling and doing a fresh install of torch / flash-attn. You could also try running on CPU or with CUDA_LAUNCH_BLOCKING=1 to see if it will give a better trace. It could also be worth checking your CUDA version vs. which CUDA version your pytorch is compiled with.
The code can't run with CPU mode. And I have rebuilt the conda environment for several times. It have a lot of error log with
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Is this right? My cuda version is 11.7 and torch.cuda.is_available is True
I can run baseline successful, but wrong in doremi120M
Do you have a more detailed trace? What part of the code raises this error?
sure,there are the detail error message. run on a *8A100 server
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb982d8c4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb982d5636b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb982e30fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fb982e017bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fb982e10d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fb9c22c2b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fb982d71e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fb982d6a69e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb982d6a7b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fb9c2548188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb9c2548535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fba0928a6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]
Traceback (most recent call last):
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc421b4e4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc421b1836b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc421bf2fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fc421bc37bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fc421bd2d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fc461084b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fc421b33e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fc421b2c69e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc421b2c7b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fc46130a188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fc46130a535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fc4a804f6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [48,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f19ad7564d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f19ad72036b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f19ad7fafa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7f19ad7cb7bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f19ad7dad80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7f19ecc8cb86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f19ad73be77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f19ad73469e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f19ad7347b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7f19ecf12188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f19ecf12535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f1a33c516a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]
Traceback (most recent call last):
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb1eefcb4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb1eef9536b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb1ef06ffa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fb1ef0407bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fb1ef04fd80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fb22e501b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fb1eefb0e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fb1eefa969e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb1eefa97b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fb22e787188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb22e787535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fb2754c66a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f545cb834d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f545cb4d36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f545cc27fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f545db0f410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f545db129e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f545db13f37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f54db71bb73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f54e3bb32de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f54e315ae83 in /lib64/libc.so.6)
Traceback (most recent call last):
File "doremi/train.py", line 405, in <module>
main()
File "doremi/train.py", line 342, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44b65cf4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44b659936b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44b6673fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f44b755b410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f44b755e9e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f44b755ff37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f4535197b73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f453d62f2de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f453cbd6e83 in /lib64/libc.so.6)
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6106bf04d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f6106bba36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f6106c94fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f6107b7c410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f6107b7f9e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f6107b80f37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f618578ab73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f618dc222de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f618d1c9e83 in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2cdf7224d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2cdf6ec36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2cdf7c6fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7f2cdf7977bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f2cdf7a6d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7f2d1ec58b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f2cdf707e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f2cdf70069e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2cdf7007b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7f2d1eede188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f2d1eede535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f2d65c1e6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]
wandb: - 0.017 MB of 0.017 MB uploaded (0.000 MB deduped)
I face the same issue, but it worked when I lowered batch_size.
I ran the
bash scripts/setup_flash.sh
without error (but it cost just a few minute)But I got a wrong message when I run the
bash scripts/run_pile.sh
what wrong with this?
also,I found something wrong with
bash scripts/run_preprocess_pile.sh
until I update the packagesdatasets
to 2.15.0 , but version insetup.py
is 2.10.1. Is something wrong in my operate?