run_training hangs and fails with many different pytorch and transformers versions including 1.4, 2.8

arccoxx commented 3 years ago

https://colab.research.google.com/drive/1euHyzSE8vbVlDAAo91FHvL5TO5nYk_XI?usp=sharing

Above I have linked my colab notebook. Inside you will find a failed attempt to train my own GeDi :( Could anyone help me get it up and running. I've now spent an embarrassing amount of time tumbling down the troubleshooting rabbit hole. The above notebook hangs after run_training.sh:

2020-11-12 04:02:07.570717: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 /content/GeDi/scripts/run_training.sh: line 26: 1254 Segmentation fault (core dumped) python ../train_GeDi.py --task_name SST-2 --overwrite_output_dir --do_eval --do_train --logit_scale --data_dir ../data/AG-news --max_seq_length 192 --overwrite_cache --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 8 --learning_rate $lr --num_train_epochs 1.0 --output_dir ../topic_GeDi_retrained --model_type gpt2 --model_name_or_path gpt2-medium --genweight $lambda --logging_steps 500 --save_steps 5000000000 --code_0 false --code_1 true

I have tried pytorch 1.4 and transformers 2.8 reccomended. I have also tried pytorch 1.3, 1.5.1, 1.6 and transformers 2.9.1, 3.4 as well all do not work with the script as anticipated (meaning they do not hang they throw errors). I think I'm getting it closed to running. Please help!

Awesome work thank you very much!

yugaljain1999 commented 3 years ago

@arccoxx you didn't include --fp16 argument at the end of run_training.sh file(line no. 26), just add --fp16 at the end of this file.

arccoxx commented 3 years ago

@yugaljain1999 Thank you for looking into this! So I updated the colab to use --fp16 just to make sure it didn't work (I had tried running the run_training.sh script as well which included fp16 just to clarify the logic). After updating the colab to use fp16 i received this error: Epoch: 0% 0/1 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py:53: UserWarning: Output 0 of SplitBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views are being deprecated and will be forbidden starting from version 1.8. Consider usingunsafe_version of the function that produced this view or don't modify this view inplace. (Triggered internally at /pytorch/torch/csrc/autograd/variable.cpp:491.) return orig_fn(*args, **kwargs) /usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py:53: UserWarning: Output 1 of SplitBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views are being deprecated and will be forbidden starting from version 1.8. Consider usingunsafe_version of the function that produced this view or don't modify this view inplace. (Triggered internally at /pytorch/torch/csrc/autograd/variable.cpp:491.) return orig_fn(*args, **kwargs) Traceback (most recent call last): File "train_GeDi.py", line 1104, in <module> main() File "train_GeDi.py", line 1053, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "train_GeDi.py", line 357, in train loss_b*=loss_mask File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 53, in wrapper return orig_fn(*args, **kwargs) RuntimeError: diff_view_meta->output_nr_ == 0 INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/variable.cpp":363, please report a bug to PyTorch.

So this is the current state of affairs re my attempt to get this awesome project to train. I really can't wait to dive in. When I ran this on a vm built from the suggested pytorch 1.4 devel image, I also received an error (with and without apex) (so python 3.7 compatability issues don't explain these errors). If someone could help me get this running on colab I'd be over the moon I'm a student and I have some interesting creative application ideas for the zero shot classification functionality of the model. Itchin to get scriptin. Any and all help will be greatly... greatly appreciated.

Thanks!

Aidan

arccoxx commented 3 years ago

@yugaljain1999 One more thing: When running pytorch 1.4 and torchvision .5 I receive a different error:

2020-11-21 21:26:28.970378: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

It then terminates. Again running this with docker image doesn't help either. More info please let me know if you can think of anything. Thanks so much!

yugaljain1999 commented 3 years ago

@yugaljain1999 One more thing: When running pytorch 1.4 and torchvision .5 I receive a different error:

2020-11-21 21:26:28.970378: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

It then terminates. Again running this with docker image doesn't help either. More info please let me know if you can think of anything. Thanks so much!

@arccoxx I have successfully trained this model on custom data in colab...Here is the link - https://colab.research.google.com/drive/1UIo62HIa3gRsn-IqVUbXb-W2B2eZc2ZV?usp=sharing

arccoxx commented 3 years ago

@yugaljain1999 Thank you very much for the reference. I cannot get this to work still. Could you link the csv you used with Pm. Modi's speeches? I tried using my own data but it threw an error when the reformatted files were not put in .tsv format. How did you get around this issue? Also when trying to run the default run_training.sh with the get_data.sh data, It fails with the internal assert error. RuntimeError: diff_view_meta->output_nr_ == 0 INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/variable.cpp":363, please report a bug to PyTorch. Ill venmo $10 for this to work. This is very important to the work I'm doing and I can't get someone to help me with it in person due to covid.

https://colab.research.google.com/drive/1EHcYtNCNcWlUodp6equg33MwbnsJUncC?usp=sharing

yugaljain1999 commented 3 years ago

@arccoxx I will look into your code, till then send me your contact on my mail id - yugal.jain1999@gmail.com to discuss more about this. Don't panic.

arccoxx commented 3 years ago

@yugaljain1999 Thank you! Breathing a sigh of relief

akhileshgotmare commented 3 years ago

@arccoxx @yugaljain1999 were you able to resolve this?

arccoxx commented 3 years ago

yes!

Thank you Yugal! Moving and grooving on the training. any advice on super cheap gpu access (like if uk anything special/industrial pro-rated)? I now need to train the gedi and its bumping into collab runtime timeout issues. If you have a few min I'd contract you to train it (data is set up just a lot of annoying mishaps during training) only for a few bucks tbh because I can just continue to run it in the background. I'm wrapping up academic projects when those are done I'll be jumping into more private research. I like to keep my team flexible with the wishful that we're like a think tank. The goal is to focus on inventive projects that might produce novel solutions instead of projects that provide incremental innovation. The philosophy being that these problems are fairer bets when competing against the data "big guns."

We had discussed working together going forward. If you're still interested would you mind sending me a resume/some work of yours?

The help you provided was very helpful indeed and thank you for reaching out again.

Best,

Aidan Collins

On Tue, Dec 1, 2020 at 5:06 AM Akhilesh Gotmare notifications@github.com wrote:

@arccoxx https://github.com/arccoxx @yugaljain1999 https://github.com/yugaljain1999 were you able to resolve this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/GeDi/issues/6#issuecomment-736372062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARTKDOFQRAWMSCUSBYRQMDSSS5ZDANCNFSM4TSZVKJA .

yugaljain1999 commented 3 years ago

@arccoxx Yeah, sure it would be great to be a part of your private research projects. I am looking forward to that and I have send you mail as well related to this. Send me your contact number so that I can help you to train GEDI on your custom data on my GPU. Send me your contact number on my mail - yugal.jain1999@gmail.com and we can surely discuss all these things out there. Regards!

ShivamSharma1997 commented 3 years ago

@arccoxx @akhileshgotmare @yugaljain1999 Can you help me with the INTERNAL ASSERTION FAILED issue, I am facing that also, how did you solve this?

yugaljain1999 commented 3 years ago

@ShivamSharma1997 using pytorch 1.6 can solve your issue.

ShivamSharma1997 commented 3 years ago

@yugaljain1999 @akhileshgotmare Thanks, it did solve the problem, but now I am getting the error

RuntimeError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 22.38 GiB total capacity; 20.89 GiB already allocated; 321.56 MiB free; 21.46 GiB reserved in total by PyTorch)

I have multiple GPUs which clear the threshold for GPU specifications mentioned in the readme.

Can you help on this please?

yugaljain1999 commented 3 years ago

@ShivamSharma1997 try to use batch_size = 2

salesforce / GeDi

run_training hangs and fails with many different pytorch and transformers versions including 1.4, 2.8 #6