huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.6k stars 26.91k forks source link

RuntimeError: unique_by_key: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #30976

Closed HackXieHao closed 4 months ago

HackXieHao commented 5 months ago

I use run_mlm.py for continuous pre-training, the Settings are as follows: CUDA_VISIBLE_DEVICES=0 python run_mlm.py \ --model_name_or_path intfloat/multilingual-e5-small \ --train_file data/train.txt \ --validation_file data/val.txt \ --max_seq_length 256 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 64 \ --warmup_steps 1000 \ --do_train \ --do_eval \ --line_by_line \ --num_train_epochs 20 \ --save_total_limit 5 \ --evaluation_strategy steps \ --eval_steps 2000 \ --save_steps 20000 \ --output_dir text_emb_train/output/models/e5_pt_20240522 \ --logging_steps 500 \ --logging_dir text_emb_train/output/logs/e5_pt_20240522

Transformers version: 4.36.1

I encountered the error:
error

And the GPU memory change is abnormal: gpu_memerory

Can anyone help to solve this problem? Thank you !

amyeroberts commented 5 months ago

cc @ArthurZucker @younesbelkada

ArthurZucker commented 5 months ago

Hey! Could you try with a more recent version of transformers? 🤗

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.