salesforce / GeDi

GeDi: Generative Discriminator Guided Sequence Generation
https://arxiv.org/abs/2009.06367
BSD 3-Clause "New" or "Revised" License
208 stars 47 forks source link

Segmentation fault error run_training.sh #7

Open arccoxx opened 3 years ago

arccoxx commented 3 years ago

I recently created an instance running all the suggested requirements. When running the default run_training script I received a "Segmentation fault error thrown by line 27 in run_training.sh which specifies the '--fp16' flag."

I then modified the setup to exclude apex and received this error:

run_training_2.sh: line 25: 1251 Segmentation fault (core dumped) python ../train_GeDi.py --task_name SST-2 --overwrite_output_dir --do_eval --do_train --logit_scale --data_dir ../data/AG-news --max_seq_length 192 --overwrite_cache --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 8 --learning_rate $lr --num_train_epochs 1.0 --output_dir ../topic_GeDi_retrained --model_type gpt2 --model_name_or_path gpt2-medium --genweight $lambda --logging_steps 500 --save_steps 5000000000 --code_0 false --code_1 true

Any thoughts on how to rectify this issue? Many thanks

Aidan

akhileshgotmare commented 3 years ago

@arccoxx I can't replicate this error. Which GPU are using if any?

There are some StackOverflow discussions on this - https://stackoverflow.com/questions/13654449/error-segmentation-fault-core-dumped (see 2nd most upvoted answer) which suggest that it might be a (CPU) RAM issue. Does reducing the --per_gpu_train_batch_size argument to 1 or 2 help?

There's also discussion in PyTorch issues which might be helpful: https://github.com/pytorch/pytorch/issues/926.