Cannot use multiple GPUs to finetune RAG using sample code with customized knowledge

Caplimbo commented 3 years ago

Environment info

transformers version: 4.5.1
Platform: linux
Python version: 3.7
PyTorch version (GPU?): torch1.8.1with cuda10.2
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

@patrickvonplaten, @lhoestq

Information

Model I am using: RAG-sequence-base

The problem arises when using:

the official example scripts:
using finetune_rag.sh(finetune_rag_ray.sh) at examples/research_projects/rag

The tasks I am working on is: *my own task or dataset: A simple test input csv as mentioned in readme of finetune rag repo is used to create my own knowledge. test.csv

To reproduce

Steps to reproduce the behavior:

create a database using use_knowledge_database.py with any sample csv.
run finetune_rag.sh using this database, with multiple GPUs
stuck at loading knowledge database, not sure its about loading indexes or what.

configs I use

    --data_dir $DATA_DIR \
    --output_dir $OUTPUT_DIR \
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --model_type rag_sequence \
    --fp16 \
    --gpus 4 \
    --profile \
    --do_train \
    --do_predict \
    --n_val -1 \
    --train_batch_size 8 \
    --eval_batch_size 1 \
    --max_source_length 128 \
    --max_target_length 25 \
    --val_max_target_length 25 \
    --test_max_target_length 25 \
    --label_smoothing 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --weight_decay 0.001 \
    --adam_epsilon 1e-08 \
    --max_grad_norm 0.1 \
    --lr_scheduler polynomial \
    --learning_rate 3e-05 \
    --num_train_epochs 100 \
    --warmup_steps 500 \
    --gradient_accumulation_steps 1 \
    --index_name custom \
    --passages_path ../try/my_knowledge_dataset \
    --index_path ../try/my_knowledge_dataset_hnsw_index.faiss

I'm using 4 Tesla T4's as my GPUs, with faiss-cpu==1.6.3, dataset==1.0.1, pyarrow==0.17.1, and switching to ray won't solve this problem either.

Expected behavior

finish loading index and proceed training

lhoestq commented 3 years ago

Hi ! Is there an error message ? Is there any CPU activity when it gets stuck (maybe the index is just being loaded) ?

Caplimbo commented 3 years ago

I'm not getting any error message, and yes is there is CPU activity. But given the fact that my dataset is quite small(check test.csv), would it really that much time(>5mins) to load it? And by the way, where would the index by loaded it? I'm confused by the instruction to use faiss-cpu. Would gpu be better?

Would be able to provide you with some screenshot of this issue later.

Caplimbo commented 3 years ago

I notice for single-gpu training there is not such steps as initializing a retriever, and if I use ray, I would get stuck at here: At this time by nvidia-smi while by top I'm not sure how to further illustrate this issue, is this because my CPU gets overloaded?

lhoestq commented 3 years ago

For such tasks CPU usage often means that FAISS (the indexing library) is doing something. Did you try interrupting the program to see if the stacktrace could help us locate at which line the code is stuck ?

Caplimbo commented 3 years ago

In the beginning of this stuck, I can use ctrl+C to interrupt, which gives the following lines:

^CTraceback (most recent call last):
  File "examples/research_projects/rag/finetune_rag.py", line 625, in <module>
    main(args)
  File "examples/research_projects/rag/finetune_rag.py", line 597, in main
    profiler=pl.profiler.AdvancedProfiler() if args.profile else None,
  File "/root/transformers-master/examples/research_projects/rag/lightning_base.py", line 389, in generic_train
    trainer.fit(model)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 249, in ddp_train
    self.model_to_device(model, process_idx)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 178, in model_to_device
    model.cuda(self.trainer.root_gpu)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 124, in cuda
    return super().cuda(device=device)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in <lambda>
    return self._apply(lambda t: t.cuda(device))
KeyboardInterrupt
Traceback (most recent call last):
  File "/root/transformers-master/examples/research_projects/rag/finetune_rag.py", line 625, in <module>
    main(args)
  File "/root/transformers-master/examples/research_projects/rag/finetune_rag.py", line 597, in main
    profiler=pl.profiler.AdvancedProfiler() if args.profile else None,
  File "/root/transformers-master/examples/research_projects/rag/lightning_base.py", line 389, in generic_train
    trainer.fit(model)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 249, in ddp_train
    self.model_to_device(model, process_idx)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 178, in model_to_device
    model.cuda(self.trainer.root_gpu)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 124, in cuda
    return super().cuda(device=device)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in <lambda>
    return self._apply(lambda t: t.cuda(device))

And when checking top, I see that among red-marked processes(2635, 2636), 2636 stays all the time while 2635 come and goes

I cannot use ctrl+C to interrupt it after a while, and when I try kill the process, result:

2021-05-06 00:28:50,962 WARNING worker.py:1115 -- The autoscaler failed with the following error:
Terminated with signal 15
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 376, in <module>
    monitor.run()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 202, in _run
    time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)

2021-05-06 00:28:50,962 WARNING worker.py:1115 -- The autoscaler failed with the following error:
Terminated with signal 15
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 376, in <module>
    monitor.run()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 202, in _run
    time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)

2021-05-06 00:28:50,967 WARNING worker.py:1115 -- The autoscaler failed with the following error:
Terminated with signal 15
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 376, in <module>
    monitor.run()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 202, in _run
    time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)

lhoestq commented 3 years ago

This line self.model_to_device(model, process_idx) says that the model is being loaded on GPU. I'm not sure why it would make the script hang though. Pinging @shamanez who has been playing with this script for #10410

If it really is because of the model loading to GPU then you may need to wait a bit more (though I'm surprised it could take that much time).

Caplimbo commented 3 years ago

As far as I'm concerned, these lines doesn't mean anything, which might just result from my interrupting it too early. The latter ones in moniter.py seems more like its reason to get stuck.

Caplimbo commented 3 years ago

And by the way, when I use ray, I use finetune_rag_ray.sh in the same dir, with gpus=4 and num_retrievers=2. Using pytorch dist results in similar problems(but torch always have only one retriever, and when I set num_retrievers=1 when using ray, I still get stuck), while using gpus=1 could do finetuning with no bugs.

shamanez commented 3 years ago

@lhoestq @Caplimbo

So what you are saying is pypi can't even start the training loop right...

Can you please let me know the size of your passage set and faiss index ? Also the RAM

Caplimbo commented 3 years ago

I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 128G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index?

shamanez commented 3 years ago

yeah, it worked for me. 30Million passages right?

On Thu, May 6, 2021 at 12:30 PM Caplimbo @.***> wrote:

I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 32G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-833139689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTOHMBJKOMG2V5KQE3TMHPIXANCNFSM44EKNA2A .

-- [image: Augmented Human Lab] http://www.ahlab.org/ [image: uni] https://www.auckland.ac.nz/en/abi.html

Gayal Shamane Ph.D. Candidate Augmented Human Lab Auckland Bioengineering Institute | The University of Auckland

Caplimbo commented 3 years ago

yeah, it worked for me. 30Million passages right? … On Thu, May 6, 2021 at 12:30 PM Caplimbo @.***> wrote: I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 32G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTOHMBJKOMG2V5KQE3TMHPIXANCNFSM44EKNA2A . -- [image: Augmented Human Lab] http://www.ahlab.org/ [image: uni] https://www.auckland.ac.nz/en/abi.html Gayal Shamane Ph.D. Candidate Augmented Human Lab Auckland Bioengineering Institute | The University of Auckland

Nope, just a size of 30MB. I attached the file test.csv in my original comment, and I use use_own_knowledge_dataset.py to process it. And by the way, I have to remove the git related codes since I have poor connection to github, could this cause potential problems?

Caplimbo commented 3 years ago

Or could you please share your environment settings? I am using dataset==1.0.1 and pyarrow==0.17.1 since higher versions would report errors when using ray, which seems to be the same with what mentioned here https://discuss.ray.io/t/cant-pickle-pyarrow-dataset-expression/1685/7. Anyway, training on single GPU works smoothly for me, so don't know what might be the problem.

shamanez commented 3 years ago

Wow super weired! I will check it out.

On Thu, May 6, 2021, 13:03 Caplimbo @.***> wrote:

yeah, it worked for me. 30Million passages right? … <#m5522908282532654251> On Thu, May 6, 2021 at 12:30 PM Caplimbo @.***> wrote: I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 32G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment) https://github.com/huggingface/transformers/issues/11592#issuecomment-833139689>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTOHMBJKOMG2V5KQE3TMHPIXANCNFSM44EKNA2A . -- [image: Augmented Human Lab] http://www.ahlab.org/ [image: uni] https://www.auckland.ac.nz/en/abi.html Gayal Shamane Ph.D. Candidate Augmented Human Lab Auckland Bioengineering Institute | The University of Auckland

Nope, just a size of 30MB. I attached the file test.csv in my original comment, and I use use_own_knowledge_dataset.py to process it. And by the way, I have to remove the git related codes since I have poor connection to github, could this cause potential problems?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-833150369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTSWP5IR6X5QZ2QY73TMHTERANCNFSM44EKNA2A .

Caplimbo commented 3 years ago

Don't know if it might help, but when using native pytorch dist, I get the following before stucked: with nvidia-smi results: And these PIDs cannot be killed, and I don't know why

Caplimbo commented 3 years ago

And by the way, I also encounters similar problems(cannot go through further training) when using BART and a Trainer. When I use torch1.8.1with cuda 10.2, I get no output and stucks, but if I switch to cuda11.1, one extra line would occur before stucking:

PyTorch version 1.8.1+cu111 available.
begin loading model ...
end loading model!
begin trainning..
  0%|                                                                      | 0/1000 [00:00<?, ?it/s]/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

And in that case, if I use cuda 11.1, finetuning RAG using pytorch dist would provide extra lines of code:

INFO:lightning:initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
in ddp connection init port -1
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
in ddp connection init port -1
in ddp connection init port -1
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
INFO:distributed_pytorch_retriever:dist initialized
in ddp connection init port -1
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 0
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 3
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 2
INFO:distributed_pytorch_retriever:dist not initialized / main
Loading index from ../covid_QA/try/my_knowledge_dataset_hnsw_index.faiss
Loaded FaissIndex embeddings from ../covid_QA/try/my_knowledge_dataset_hnsw_index.faiss
in ddp end init port -1
in ddp end init port -1
in ddp end init port -1
in ddp end init port -1

But still stucks afterwards. Here I add few extra lines to see what happens, but so far no results I can get

lhoestq commented 3 years ago

Could you try changing the --distributed-port just in case ?

Caplimbo commented 3 years ago

Will do later, but what values should I test?

lhoestq commented 3 years ago

Any value between 4000 and 40000, just to make just it's not an issue with the default value -1

Caplimbo commented 3 years ago

No difference, still stucks at the same step

Caplimbo commented 3 years ago

@Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2

However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this?

shamanez commented 3 years ago

What kinda of change you did ?

On Mon, May 10, 2021, 18:38 Caplimbo @.***> wrote:

@Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2

However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-836247855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWONRQDTXQE32BBJMLTM55OBANCNFSM44EKNA2A .

Caplimbo commented 3 years ago

What kinda of change you did ? … On Mon, May 10, 2021, 18:38 Caplimbo @.***> wrote: @Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2 However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWONRQDTXQE32BBJMLTM55OBANCNFSM44EKNA2A .

I added one additional input to the model, and send it to the retriever to modify the retrival process. This is just one scalar tensor per sample, so I don't think itself would cause problem...

shamanez commented 3 years ago

So other than the question, you add another tensor right? Then did you also changed what you get output the retriever?

Then did you also modify the input to answer generator?

On Mon, May 10, 2021, 18:42 Caplimbo @.***> wrote:

What kinda of change you did ? … <#m2194681267189680230> On Mon, May 10, 2021, 18:38 Caplimbo @.***> wrote: @Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2 However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment) https://github.com/huggingface/transformers/issues/11592#issuecomment-836247855>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWONRQDTXQE32BBJMLTM55OBANCNFSM44EKNA2A .

I added one additional input to the model, and send it to the retriever to modify the retrival process. This is just one scalar tensor per sample, so I don't think itself would cause problem...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-836250760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUKHBBA73T5IUIXSADTM554NANCNFSM44EKNA2A .

Caplimbo commented 3 years ago

The changes of output to the retriever is minor. For example, suppose one document we retrieve before encoding it has test "This is example text", what I did is prepend it to something like " This is example text" , and then encode it to get context_input_ids. After this operation, nothing more was done to the generator input.

Caplimbo commented 3 years ago

The OOM issue didn't happen at the beginning of an epoch by the way, it's like after 10-20 steps that it happens. When using original RAG,~~I didn't see such behavior of increasing memory demands, does this mean I have something wrong with my own implementation?~~ It's the same even if I use the original version of RAG But so far as I can see, using ray for example, with batchsize=1 on each GPU, occupies almost the same amout of GPU memories as training with batchsize=2 on a single GPU. I suspect the parallel implementation also loads part of the retrieval process on GPU?

Caplimbo commented 3 years ago

And by the way, since I have to to some pretraining to the generator part, I separately trained(or tuned) a BART from its pretrained weight on huggingface, and then plunge it in to RAG by RagSequenceForGeneration.from_pretrained_question_encoder_generator. I'm a little bit worried that RAG model weights provided on huggingface has a different setting with the original BART weight provided, and maybe that's why I cannot get the loss go down like an original RAG does. If that's the case, Should I wait until it converges(like, in 100 epoches?), or should I first separate the generator part from RAG and do my pretraining(tuning) on it? And how can I save the generator part separately?

shamanez commented 3 years ago

The changes of output to the retriever is minor. For example, suppose one document we retrieve before encoding it has test "This is example text", what I did is prepend it to something like " This is example text" , and then encode it to get context_input_ids. After this operation, nothing more was done to the generator input.

This might be due to your input size is a bit too much longer. and GPU allocation changes according to the length of your input during the training. So sometimes after few steps, you can get an OOM error. This is kinda the answer to your second issue.

shamanez commented 3 years ago

The OOM issue didn't happen at the beginning of an epoch by the way, it's like after 10-20 steps that it happens. When using original RAG,~I didn't see such behavior of increasing memory demands, does this mean I have something wrong with my own implementation?~ It's the same even if I use the original version of RAG But so far as I can see, using ray for example, with batchsize=1 on each GPU, occupies almost the same amout of GPU memories as training with batchsize=2 on a single GPU. I suspect the parallel implementation also loads part of the retrieval process on GPU?

No, when using RAY nothing gets loaded into the GPU. You can see it by using the top command. If your index is around 20 GB, you can find retriever workers occupy that amount of memory.

Caplimbo commented 3 years ago

But it's weird that when using only 1 GPU, I can deploy a batch size of 2 with no OOM, while using 2 GPUs leads to OOM even if batch size (per GPU) is 1. Here batchsize1 OOM only occurs in my modified rag, but with original rag I still get OOM with batch size 2 per device when using multiple GPUs.

shamanez commented 3 years ago

I think this is due to low GPU memory. Try to use two 32GB ones. I assume the initialization of the DDP process would consume bit more memory.

@Caplimbo when you run with a single GPU can you send me the memory usage?

Caplimbo commented 3 years ago

@shamanez Sadly I don't have such powerful GPUs. When I use a single GPU, the memory usage of batch size of 2 should be around 15026MB(as far as I can recall, cannot check it now since I'm using all GPUs for training), while with 4GPUs, I managed to train it with batch size 1 per GPU, with a memory usage of 15072MB/GPU. I have to reduce max_target_length from 25 to 24, otherwise OOM.

shamanez commented 3 years ago

yeah, make sense.

Caplimbo commented 3 years ago

Really? I don't see why using batch size 1 per GPU on a multiple GPU setting would require more memory than using batch size 2 on a single GPU...

shamanez commented 3 years ago

Can u send me a screen shot of memory use when you are using a single gpu with batch size one ?

On Wed, May 12, 2021, 23:07 Caplimbo @.***> wrote:

Really? I don't see why using batch size 1 per GPU on a multiple GPU setting would require more memory than using batch size 2 on a single GPU...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-839683794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGVIENWUXQI2H7CJEZDTNJON7ANCNFSM44EKNA2A .

Caplimbo commented 3 years ago

Fine, will do after this round of training is over. Maybe in a day orz.

shamanez commented 3 years ago

At the moment just send me one screen shot. With nvidia-smil.

On Wed, May 12, 2021, 23:10 Caplimbo @.***> wrote:

Fine, will do after this round of training is over. Maybe in a day orz.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-839686105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGXSV6IRXU6PHAQ6FNDTNJO3DANCNFSM44EKNA2A .

Caplimbo commented 3 years ago

You mean now with multiple GPU training and batch size 1 on each GPU? then it's like:

shamanez commented 3 years ago

See gpu memory is almost up to the limit. So I assume during the DDP the master GPUs requires bit more memory. This causes an OOM error. In my lab I have a 2 11GB GPUs. Sometimes I also observe the same.

On Wed, May 12, 2021, 23:18 Caplimbo @.***> wrote:

You mean now with multiple GPU training and batch size 1 on each GPU? then it's like: [image: image] https://user-images.githubusercontent.com/43310105/117966668-c181ca00-b356-11eb-9504-ec87e61f4c64.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-839690195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTUTX5H3JY4C6TKBKDTNJPX5ANCNFSM44EKNA2A .

Caplimbo commented 3 years ago

See gpu memory is almost up to the limit. So I assume during the DDP the master GPUs requires bit more memory. This causes an OOM error. In my lab I have a 2 11GB GPUs. Sometimes I also observe the same. … On Wed, May 12, 2021, 23:18 Caplimbo @.***> wrote: You mean now with multiple GPU training and batch size 1 on each GPU? then it's like: [image: image] https://user-images.githubusercontent.com/43310105/117966668-c181ca00-b356-11eb-9504-ec87e61f4c64.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTUTX5H3JY4C6TKBKDTNJPX5ANCNFSM44EKNA2A .

Sure, when using pytorch for distributed training such behavior is quite usual, but when using ray... I don't know for sure. Will provide you with more infomation once I can do again single GPU training.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers