salesforce / GeDi

GeDi: Generative Discriminator Guided Sequence Generation
https://arxiv.org/abs/2009.06367
BSD 3-Clause "New" or "Revised" License
207 stars 47 forks source link

Apex issue #14

Open InaamHassan opened 2 years ago

InaamHassan commented 2 years ago

so when i run "!bash run_training.sh" after "%cd scripts", I get the following issue.

`09/03/2021 08:57:05 - INFO - main - Saving features into cached file ../data/AG-news/cached_train_gpt2-medium_192_sst-2 Traceback (most recent call last): File "../train_GeDi.py", line 193, in train from apex import amp ModuleNotFoundError: No module named 'apex'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "../train_GeDi.py", line 1103, in main() File "../train_GeDi.py", line 1052, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "../train_GeDi.py", line 195, in train raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.") ImportError: Please install apex from https://www.github.com/nvidia/apex to use fp16 training.`

Although the apex is installed. How to cater this issue.

akhileshgotmare commented 2 years ago

Hi! Did you follow the commands in this script for setting up apex? https://github.com/salesforce/GeDi/blob/master/scripts/setup.sh Reference: https://github.com/NVIDIA/apex#linux

InaamHassan commented 2 years ago

Yes i did follow those commands. They did not help. I identified the issue and resolved it by just commenting our the exception raised. It installs after we do that without any errors. But the nest thing i face due to it is during training time it raises another exception:

Epoch: 0% 0/1 [00:01<?, ?it/s] Traceback (most recent call last): File "../train_GeDi.py", line 1103, in main() File "../train_GeDi.py", line 1052, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "../train_GeDi.py", line 355, in train loss_a=loss_mask File "/usr/local/lib/python3.7/dist-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/wrap.py", line 53, in wrapper return orig_fn(args, **kwargs) RuntimeError: Output 0 of SplitBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views is forbidden. You should replace the inplace operation by an out-of-place one.

I cannot find any clue on how to solve this. No resources found online and i have tried to alter as much code as i can but to no avail.

InaamHassan commented 2 years ago

I was able to resolve this error. You just have to change

loss_a=loss_mask loss_b=loss_mask

to

loss_a = loss_a loss_mask loss_b = loss_b loss_mask

in train_gedi.py at line 355. This occurs due to an internal inplace function happening when you write the upper mentioned code.

InaamHassan commented 2 years ago

I am running my code on google colab with 12 GB of RAM and on CUDA. But it is giving me these errors.

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.42 GiB already allocated; 1.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Just because of allocating 12MiB the CUDA memory overloads. How to free up space from PyTorch as it has reserved much of it. What i have tried on my end is

  1. Cache Cleaning
  2. Runtime Restart
  3. Reduced dataset

But to no avail.