Pegasus finetuning: OOM

laibamehnaz commented 3 years ago

Epoch 0: 91% 5747/6331 [39:52<04:03, 2.40it/s, loss=75.765, v_num=2]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) tcmalloc: large alloc 1083260928 bytes == 0x1aece0000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1354080256 bytes == 0x21e5c000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1692606464 bytes == 0x7f10651ce000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2115764224 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2644705280 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 3305881600 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 4132356096 bytes == 0x7f0e530f2000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 5165449216 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 ./finetune_pegasus_xsum.sh: line 15: 876 Killed

I appreciate any help. Thank you.

patil-suraj commented 3 years ago

Hi @laibamehnaz can you also post env info

patil-suraj commented 3 years ago

have seen this issue with colab, when the RAM usage suddenly increases colab just crashes it , were you using colab ?

Just to confirm can you try using less examples, you can control the number of examples using --n_train

OR

My guess is at that point when it crashed it may have received a longer sentence which could have resulted in the whole batch being large. If running on single GPU, you can use sortish_sampler which samples the longer batches first, so we can catch these types of errors early, can be enabled using --sortish_sampler

laibamehnaz commented 3 years ago

Yes, I am using Colab. Sure, I will check with --sortish_sampler.

laibamehnaz commented 3 years ago

So I tried what you said but still get the same error. Tried with lesser training examples, still getting the same error. Tried with fewer validation examples as well. It seems like this error comes every time the first validation loop is ended, no matter the size.

patil-suraj commented 3 years ago

I see. Could you post the env info, GPU, RAM etc ? and the specific command that you ran which resulted in this error ? I will try to reproduce it

laibamehnaz commented 3 years ago

GPU: Tesla K80 RAM: 12GB

./finetune_pegasus_xsum.sh \ --data_dir ./data/ \ --output_dir ./output/ \ --train_batch_size=2 \ --eval_batch_size=2 \ --val_check_interval 0.5 \ --num_train_epochs 3 \ --gradient_accumulation_steps 128 \ --model_name_or_path google/pegasus-xsum

sshleifer commented 3 years ago

I'd add --freeze_embeds --sortish_sampler. LMK how it goes, happy to help!

laibamehnaz commented 3 years ago

I have tried with them as well. Same issue :(

laibamehnaz commented 3 years ago

Hi @sshleifer , Looks like I hadn't tried with --freeze_embeds. Works well now. Thanks a lot.

Even though it leads to full completion, I still get something like this :

Epoch 3: 100% 300/300 [09:26<00:00, 1.89s/it, loss=97.408, v_num=6] ./finetune_pegasus_xsum.sh: line 16: 551 Killed

And, it doesn't generate summaries for the test set even with --do_predict

patil-suraj commented 3 years ago

@sshleifer similar issue here #6665

sshleifer commented 3 years ago

--do_predict doesn't work (there is a PL bug), you have to use run_eval.py to evaluate.

Here is command I ran to evaluate pegasus-xsum on xsum/test data:

mkdir gens
export DATA_DIR=xsum
python run_eval.py google/pegasus-xsum \
    $DATA_DIR/test.source gens/peg_xsum_test_generation.txt \
    --reference_path $DATA_DIR/test.target \
    --score_path gens/peg_xsum_rouge.txt --task summarization \
    --device cuda \
    --bs 8

sshleifer commented 3 years ago

I have seen "killed" at/towards the end of training a few times and just ignore it.

laibamehnaz commented 3 years ago

Alright. Thank you so much :))

patil-suraj commented 3 years ago

@sshleifer I got core dumped with this, even with --freeze-embeds on 16GB P100

!bash finetune_pegasus_xsum.sh \
  --train_batch_size 2 \
  --eval_batch_size 4 \
  --model_name_or_path google/pegasus-large \
  --n_train 256 \
  --n_val 256 \
  --output_dir xsum_pegasus_test_4 \
  --data_dir xsum \
  --gpus 1 \
  --sortish_sampler \
  --val_check_interval 0.02

 File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_io.py", line 273, in save_checkpoint
    self._atomic_save(checkpoint, filepath)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_io.py", line 264, in _atomic_save
    torch.save(checkpoint, tmp_path)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 365, in save
    return
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 258, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:262] . unexpected pos 869515520 vs 869515408
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:262] . unexpected pos 869515520 vs 869515408
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f71b2235fd7 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x228ff30 (0x7f71eb51af30 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x228c163 (0x7f71eb517163 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0x17b (0x7f71eb51c10b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0xe1 (0x7f71eb51cca1 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x115 (0x7f71eb51d495 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x5a35e3 (0x7f71f9e0b5e3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x273c00 (0x7f71f9adbc00 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x274e4e (0x7f71f9adce4e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #9: python3() [0x588a98]
frame #10: python3() [0x5ad558]
frame #11: python3() [0x5ad56e]
frame #12: python3() [0x5ad56e]
frame #13: python3() [0x5ad56e]
frame #14: python3() [0x5ad56e]
frame #15: python3() [0x5ad56e]
frame #16: python3() [0x5ad56e]
frame #17: python3() [0x5ad56e]
frame #18: python3() [0x5ad56e]
frame #19: python3() [0x5ad56e]
frame #20: python3() [0x5ad56e]
frame #21: python3() [0x5ad56e]
frame #22: python3() [0x5ad56e]
frame #23: python3() [0x5ad56e]
frame #24: python3() [0x5ad56e]
frame #25: python3() [0x5ad56e]
frame #26: python3() [0x5ad56e]
frame #27: python3() [0x56b636]
<omitting python frames>
frame #33: __libc_start_main + 0xe7 (0x7f72058bdb97 in /lib/x86_64-linux-gnu/libc.so.6)

finetune_pegasus_xsum.sh: line 14:  2967 Aborted                 (core dumped) python finetune.py --learning_rate=1e-4 --do_train --do_predict --n_val 1000 --val_check_interval 0.25 --max_source_length 512 --max_target_length 56 --freeze_embeds --max_target_length 56 --label_smoothing 0.1 "$@"

GPU: P100 (16GB) RAM: 12 GB

colab

sshleifer commented 3 years ago

Crazy traceback. Is that torch 1.6 @patil-suraj ?

patil-suraj commented 3 years ago

Yes, 1.6.0+cu101

patil-suraj commented 3 years ago

Also --freeze_embeds is already present in finetune_pegasus_xsum.sh here, so I'm a bit confused about how adding extra --freeze_embeds solved @laibamehnaz 's issue.

laibamehnaz commented 3 years ago

Yes, i was very confused about that too.

laibamehnaz commented 3 years ago

I tried with just 100 training examples, and added an extra --freeze_embeds and it worked. Now I am trying on the entire dataset and checking.

patil-suraj commented 3 years ago

LMK how it goes, I'm thinking that this is not GPU OOM. This issue was previously observed when RAM usage suddenly increased on colab

laibamehnaz commented 3 years ago

Same issue again :/ Epoch 0: 50% 3415/6831 [49:00<49:00, 1.16it/s, loss=88.546, v_num=7]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) tcmalloc: large alloc 1202077696 bytes == 0x181364000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1502601216 bytes == 0x212e2000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1878253568 bytes == 0x7f9ef00c2000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2347819008 bytes == 0x7f9e641b4000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2934775808 bytes == 0x7f9db52e2000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 3668475904 bytes == 0x7f9e641b4000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 4585594880 bytes == 0x7f9ca3db8000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 5731999744 bytes == 0x7f9db52e2000 @ 0x7fa01e612615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7fa0128cb950 0x7fa0128cfbf7 0x7fa012c007e8 0x7fa012bb61b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 ./finetune_pegasus_xsum.sh: line 16: 453 Killed

patil-suraj commented 3 years ago

Again after validation loop ?

laibamehnaz commented 3 years ago

Yes, exactly after the first validation loop.

patil-suraj commented 3 years ago

Did you notice the RAM usage ? Seems related to serialisation or RAM

laibamehnaz commented 3 years ago

No, I didn't. I don't think I can check now, right?

patil-suraj commented 3 years ago

Yes, need to check when it's executing. can check ram usage in top right corner of colab when it's executing

laibamehnaz commented 3 years ago

Sure, lemme run it again and check. Will let you know.

laibamehnaz commented 3 years ago

Same thing again, fails right after the first validation loop. RAM usage at the exact end of validation loop : 7.86GB/12.72GB and then the same error as before.

patil-suraj commented 3 years ago

@laibamehnaz tcmalloc is google's fancy malloc alternative and throws this error when it thinks that the requested memory might exceed the available memory.

tcmalloc: large alloc 5731999744 bytes means it's trying to alloc ~5.73GB, so I think memory usage is peaking up when saving the checkpoint (which is large for pegasus, ~5GB) which is resulting in this error.

Strangely, I ran this multiple times with 100 XSUM examples on the same K80 and 12 GB RAM instance and didn't see this error This is the command that I used

!bash finetune_pegasus_xsum.sh \
  --model_name_or_path google/pegasus-xsum \
  --data_dir xsum \
  --output_dir xsum_pegasus_test_4 \
  --train_batch_size 2 \
  --eval_batch_size 2 \
  --num_train_epochs 1 \
  --n_train 100 \
  --n_val 100 \
  --sortish_sampler \
  --gpus 1 \
  --val_check_interval 0.25 \
  --gradient_accumulation_steps 4 \

colab

maybe switching to higher RAM instance should solve this issue. But let's wait for @sshleifer's answer

laibamehnaz commented 3 years ago

Right right, I understand. Actually, I am running this on my own dataset, and not on XSUM.

mc2259 commented 3 years ago

`%%bash source venv/bin/activate cd transformers cd examples cd seq2seq

./finetune.sh \ --data_dir /content/xsum \ --train_batch_size=1 \ --eval_batch_size=1 \ --output_dir=xsum_results \ --num_train_epochs 1 \ --model_name_or_path facebook/bart-large `

I am getting an error where the finetuning gets killed. This was the command I am running for fine-tuning BART and I am using a python3 virtual environment and working on a Google Cloud Platform instance. I think the issue is that currently it shows torch.cuda.is_available() as false. When I tried running it on the terminal, it showed that I need to install an NVIDIA driver. However when I try to run the commands sudo /opt/deeplearning/install-driver.sh, I get an error.

sshleifer commented 3 years ago

@mc2259 I wish I could help. When I get nvidia driver errors on gcp I just get a new instance 🤷 . Feel free to start a GCP troubleshooting thread on the forums! I am a big GCP user and would love to exchange tips and tricks.

https://discuss.huggingface.co/

sshleifer commented 3 years ago

@patil-suraj your intuition sounds reasonable, I am on a higher ram instance and can't reproduce.

Smaller Checkpoint strategies:

At some point I was passing save_weights_only to the checkpointer here, but I had subsequent --resume_from_checkpoint failure. This was on pl 0.8.1 so I would welcome a PR that added it back and tested that --resume_from_checkpoint works on the skinnier checkpoint.

patil-suraj commented 3 years ago

@sshleifer looking at pl docs save_weights_only won't save optimizer state so resume_from_checkpoint will fail. I'll see if there's an alternative.

I think Seq2SeqTrainer will take care of such issues, since it saves each state_dict separately.

laibamehnaz commented 3 years ago

@patil-suraj your intuition sounds reasonable, I am on a higher ram instance and can't reproduce.

Smaller Checkpoint strategies:

At some point I was passing save_weights_only to the checkpointer here, but I had subsequent --resume_from_checkpoint failure. This was on pl 0.8.1 so I would welcome a PR that added it back and tested that --resume_from_checkpoint works on the skinnier checkpoint.

Alright, I can try this. For model inference, saving only the weights should be fine. Thank you.

laibamehnaz commented 3 years ago

I trained the model on my entire dataset but when I generated the summaries, the first word is always missing.

Few examples of the summaries generated on the test set: doesn't want to go to Chloe's family reunion. Ivy wants Carter to stay with his family. got a Christmas tree from Sammie's work. Bart and Ingrid want to buy it for Noah's nativity. 's guitar broke down. The sound engineer told him to put it on the blacklist. Trent and Hugh will see each other in Kingston next show. 's birthday is on Saturday at 8. bought a black top from H&M. Marie will give it back to Sophie. and Maxx are going to the birthday party.

What could be the mistake here? And how can I fix this? Thank you.

patil-suraj commented 3 years ago

Hey @laibamehnaz ,

set force_bos_token_to_be_generated in the config file to False, before you call .generate

you can do it using model.config.force_bos_token_to_be_generated = False

also did switching to high RAM instance solved your issue ?

laibamehnaz commented 3 years ago

Sure, lemme try this and check. Actually, I don't have resources to switch to a higher RAM instance. Tried with 12GB RAM itself, but only saved the weights as pointed by @sshleifer

laibamehnaz commented 3 years ago

Hey @laibamehnaz ,

set force_bos_token_to_be_generated in the config file to False, before you call .generate

you can do it using model.config.force_bos_token_to_be_generated = False

also did switching to high RAM instance solved your issue ?

Doesn't seem to help with the summaries. They are still being generated without the first word.

sshleifer commented 3 years ago

I'm investigating and will update in either direction tonight.

laibamehnaz commented 3 years ago

Thank you so much :))

sshleifer commented 3 years ago

My fix works, master will have it when #6654 is merged, which will probably be next week.

mc2259 commented 3 years ago

I am running into a similar issue with this command:

%%bash source venv/bin/activate cd transformers cd examples cd seq2seq

./finetune.sh \ --data_dir ./xsum \ --train_batch_size=1 \ --eval_batch_size=1 \ --output_dir=xsum_results \ --num_train_epochs 1 \ --model_name_or_path facebook/bart-large

I am getting ./finetune.sh: line 14: 16531 Killed python finetune.py --learning_rate=3e-5 --fp16 --gpus 1 --do_train --do_predict --n_val 1000 --val_check_interval 0.1 "$@"

mc2259 commented 3 years ago

@patil-suraj I tried running the pegasus fine-tuning commands on the collab you linked after connecting to my local runtime and I still get 'killed'. I am not sure why that is happening. I am connected to my local Virtual machine runtime and am using a virtual environment.

https://colab.research.google.com/drive/1pdHcn2E5CjSGVEEOQo2qCi5Bby5-pOOZ

patil-suraj commented 3 years ago

Hey @mc2259 , when did it get killed ? If it got killed at the end of then it's fine, as @sshleifer said there's bug with --do_predict

mc2259 commented 3 years ago

@patil-suraj It is getting killed in the beginning. The same thing is happening with my BART fine-tuning. I am uploading screenshots if that is helpful.

sshleifer commented 3 years ago

I have it working pretty well on master on an NVIDIA-RTX with the following command. (It might get killed, still running, but now at ROUGE2 >= 23.6 for xsum, where the goal is ~24.5). Sorry it's gross. Feel free to pretty it up and repost.

Set $BS as big as possible to fit in memory.
Set $GAS = 256/$BS
This takes a long time. Has been running for 25h for me and scheduled to last about 30h. 5hr/epoch on an NVIDIA-RTX. I'm working on distilled versions, but don't have a good way to distill pegasus-large.
It might still get Killed at the end.
wandb link: https://app.wandb.ai/sshleifer/transformers_fork-examples_seq2seq/runs/3cz2fe87?workspace=user-sshleifer
important changes: new decoder_input_ids on master to solve the truncation problem + --adafactor optimizer.

Have a great weekend!

Command

python finetune.py   --task summarization   --learning_rate=3e-4   --do_train   --do_predict \
       --val_check_interval 0.25 --n_val 1000   --data_dir xsum   --max_source_length 512 --max_target_length=56 \
       --freeze_embeds   --model_name_or_path google/pegasus-large   --tokenizer_name google/pegasus-xsum  \
       --warmup_steps 500   --dropout 0.1 --attention_dropout 0.1 --label_smoothing 0.1   \
       --train_batch_size=$BS --eval_batch_size=8 --gradient_accumulation_steps=$GAS   \
       --logger_name wandb   --sortish_sampler --gpus 1   --output_dir xsum_ft_ls_mask_fix \
       --num_train_epochs 6 --adafactor

huggingface / transformers

Pegasus finetuning: OOM #6711

Command