Issue in running finetune paraphrase script

martiansideofthemoon / style-transfer-paraphrase

Official code and data repository for our EMNLP 2020 long paper "Reformulating Unsupervised Style Transfer as Paraphrase Generation" (https://arxiv.org/abs/2010.05700).

http://style.cs.umass.edu

MIT License

228 stars 45 forks source link

Issue in running finetune paraphrase script #35

Closed abhisha1991 closed 1 year ago

abhisha1991 commented 2 years ago

Hey Kalpesh and team,

Thanks very much for releasing your work - it is great to see a simple architecture like this being implemented for something novel. We're trying to just get up and running with the base set up - we have downloaded all the data and corresponding models to the right folders. However, upon running the fine tune training, we get the attached error

Our setup is a cloud VM with 1 GPU core (Nvidia Tesla T4), Ubuntu 18.04, 7.5 GB RAM, pytorch 1.10, cuda 11.5 We have confirmed pytorch is installed and available along with CUDA on the machine (see attachments)

We'd be incredibly grateful if you could release a docker image with pre-installed dependencies or tell us the exact error mode we are facing below. We're unable to proceed past this error. We're also unable to locate the error logs here (~/style-transfer-paraphrase/style_paraphrase/logs) and thus unable to understand what is wrong with our setup

error cuda pytorch

martiansideofthemoon commented 2 years ago

Hi @abhisha1991, Unfortunately I've not encountered the error before so I'm not 100% sure the following will work. But they are still worth a try ---

Try downgrading PyTorch to 1.7.. I've confirmed it works on my cluster with PyTorch 1.7 / CUDA 10.1.
Try removing the DDP dependencies from the command, remove -m torch.distributed.launch --nproc_per_node=1 from the bash script. That way there will only be a single PyTorch process running the code. That way args.local_rank will be automatically set to -1. If this gives you any error, let me know
The CPU RAM seems quite low (7GB), so I'm wondering if you are getting an OOM error in a child process.

Solution 2 is probably much lesser work so I suggest trying that first.

TufailAhmadSiddiq commented 1 year ago

Hello @martiansideofthemoon I hope you are fine and doing great. I am facing another problem related to the run_finetune_paraphrase.sh. I am trying to run that in Google Colab and on execution it takes a few seconds and shows that CUDA GPU is out of memory. I also have experimented with changing batch size in the sh file but it didn't work. I would appreciate your help with that. GPU out of memory

martiansideofthemoon commented 1 year ago

hi @TufailAhmadSiddiq , what's the smallest batch size you tried? Reducing batch size is ok since you can do gradient accumulation to have a larger effective batch size

TufailAhmadSiddiq commented 1 year ago

Thanks for the reply. The minimum batch size I have used is 2 but still facing the same problem

martiansideofthemoon commented 1 year ago

This is with GPT2-large? As long as batch size 1 fits, it should be ok. You can also change GPT2-large to GPT2-medium, it doesn't drop performance much. Another solution could be gradient checkpointing

TufailAhmadSiddiq commented 1 year ago

Can you please point out where I should make changes to work it perfectly?

martiansideofthemoon commented 1 year ago

Change this to gpt2-medium: https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/style_paraphrase/examples/run_finetune_paraphrase.sh#L23

and reduce batch size here: https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/style_paraphrase/examples/run_finetune_paraphrase.sh#L32

TufailAhmadSiddiq commented 1 year ago

Thanks for the guidance. I do it and check whether it works or not.

HassanBinAli commented 1 year ago

Hello! Hope you are doing good. I am trying to fine tune your model on my custom dataset. When I run !style_paraphrase/examples/run_finetune_paraphrase.sh, I get the following error: Capture1 I followed first two steps of "Custom Datasets" in this repository. At third step, while converting BPE code to fairseq binaries, "Permission denied" occurs.

martiansideofthemoon commented 1 year ago

@HassanBinAli i think you are missing the dataset files in the repo. Please download the train.pickle file from here and place it in datasets/paranmt_filtered/train.pickle.

HassanBinAli commented 1 year ago

Thank You. It resolved the error.