Open arccoxx opened 3 years ago
@arccoxx I can't replicate this error. Which GPU are using if any?
There are some StackOverflow discussions on this - https://stackoverflow.com/questions/13654449/error-segmentation-fault-core-dumped (see 2nd most upvoted answer) which suggest that it might be a (CPU) RAM issue. Does reducing the --per_gpu_train_batch_size
argument to 1 or 2 help?
There's also discussion in PyTorch issues which might be helpful: https://github.com/pytorch/pytorch/issues/926.
I recently created an instance running all the suggested requirements. When running the default run_training script I received a "Segmentation fault error thrown by line 27 in run_training.sh which specifies the '--fp16' flag."
I then modified the setup to exclude apex and received this error:
run_training_2.sh: line 25: 1251 Segmentation fault (core dumped) python ../train_GeDi.py --task_name SST-2 --overwrite_output_dir --do_eval --do_train --logit_scale --data_dir ../data/AG-news --max_seq_length 192 --overwrite_cache --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 8 --learning_rate $lr --num_train_epochs 1.0 --output_dir ../topic_GeDi_retrained --model_type gpt2 --model_name_or_path gpt2-medium --genweight $lambda --logging_steps 500 --save_steps 5000000000 --code_0 false --code_1 true
Any thoughts on how to rectify this issue? Many thanks
Aidan