princeton-nlp / CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
MIT License
192 stars 31 forks source link

when the training is going to end,occurred error #57

Open zll0000 opened 10 months ago

zll0000 commented 10 months ago

A/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. Traceback (most recent call last): File "./run_glue_prune.py", line 394, in main() File "./run_glue_prune.py", line 385, in main trainer.train() File "/bit_share//LLM/Fitune_LLM/model_pruning/CoFiPruning/trainer/trainer.py", line 285, in train loss_terms = self.training_step(model, inputs) File "/bit_share/zhangxiaolei/LLM/Fitune_LLM/model_pruning/CoFiPruning/trainer/trainer.py", line 704, in training_step loss.backward() File "/data03//anaconda3/envs/llmprune/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/data03//anaconda3/envs/llmprune/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: device-side assert triggered

xiamengzhou commented 10 months ago

Hi it seems mostly like a data issue. Could you check that the input_ids are all within the vocabulary size?