Closed Caplimbo closed 3 years ago
Hi ! Is there an error message ? Is there any CPU activity when it gets stuck (maybe the index is just being loaded) ?
I'm not getting any error message, and yes is there is CPU activity. But given the fact that my dataset is quite small(check test.csv), would it really that much time(>5mins) to load it? And by the way, where would the index by loaded it? I'm confused by the instruction to use faiss-cpu. Would gpu be better?
Would be able to provide you with some screenshot of this issue later.
I notice for single-gpu training there is not such steps as initializing a retriever, and if I use ray, I would get stuck at here:
At this time by nvidia-smi
while by top
I'm not sure how to further illustrate this issue, is this because my CPU gets overloaded?
For such tasks CPU usage often means that FAISS (the indexing library) is doing something. Did you try interrupting the program to see if the stacktrace could help us locate at which line the code is stuck ?
In the beginning of this stuck, I can use ctrl+C to interrupt, which gives the following lines:
^CTraceback (most recent call last):
File "examples/research_projects/rag/finetune_rag.py", line 625, in <module>
main(args)
File "examples/research_projects/rag/finetune_rag.py", line 597, in main
profiler=pl.profiler.AdvancedProfiler() if args.profile else None,
File "/root/transformers-master/examples/research_projects/rag/lightning_base.py", line 389, in generic_train
trainer.fit(model)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 249, in ddp_train
self.model_to_device(model, process_idx)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 178, in model_to_device
model.cuda(self.trainer.root_gpu)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 124, in cuda
return super().cuda(device=device)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in <lambda>
return self._apply(lambda t: t.cuda(device))
KeyboardInterrupt
Traceback (most recent call last):
File "/root/transformers-master/examples/research_projects/rag/finetune_rag.py", line 625, in <module>
main(args)
File "/root/transformers-master/examples/research_projects/rag/finetune_rag.py", line 597, in main
profiler=pl.profiler.AdvancedProfiler() if args.profile else None,
File "/root/transformers-master/examples/research_projects/rag/lightning_base.py", line 389, in generic_train
trainer.fit(model)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 249, in ddp_train
self.model_to_device(model, process_idx)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 178, in model_to_device
model.cuda(self.trainer.root_gpu)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 124, in cuda
return super().cuda(device=device)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/module.py", line 491, in <lambda>
return self._apply(lambda t: t.cuda(device))
And when checking top
, I see that among red-marked processes(2635, 2636), 2636 stays all the time while 2635 come and goes
I cannot use ctrl+C to interrupt it after a while, and when I try kill the process, result:
2021-05-06 00:28:50,962 WARNING worker.py:1115 -- The autoscaler failed with the following error:
Terminated with signal 15
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 376, in <module>
monitor.run()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 284, in run
self._run()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 202, in _run
time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
2021-05-06 00:28:50,962 WARNING worker.py:1115 -- The autoscaler failed with the following error:
Terminated with signal 15
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 376, in <module>
monitor.run()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 284, in run
self._run()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 202, in _run
time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
2021-05-06 00:28:50,967 WARNING worker.py:1115 -- The autoscaler failed with the following error:
Terminated with signal 15
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 376, in <module>
monitor.run()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 284, in run
self._run()
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/ray/_private/monitor.py", line 202, in _run
time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
This line self.model_to_device(model, process_idx)
says that the model is being loaded on GPU. I'm not sure why it would make the script hang though.
Pinging @shamanez who has been playing with this script for #10410
If it really is because of the model loading to GPU then you may need to wait a bit more (though I'm surprised it could take that much time).
As far as I'm concerned, these lines doesn't mean anything, which might just result from my interrupting it too early. The latter ones in moniter.py seems more like its reason to get stuck.
And by the way, when I use ray, I use finetune_rag_ray.sh in the same dir, with gpus=4 and num_retrievers=2. Using pytorch dist results in similar problems(but torch always have only one retriever, and when I set num_retrievers=1 when using ray, I still get stuck), while using gpus=1 could do finetuning with no bugs.
@lhoestq @Caplimbo
So what you are saying is pypi can't even start the training loop right...
Can you please let me know the size of your passage set and faiss index ? Also the RAM
I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 128G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index?
yeah, it worked for me. 30Million passages right?
On Thu, May 6, 2021 at 12:30 PM Caplimbo @.***> wrote:
I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 32G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-833139689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTOHMBJKOMG2V5KQE3TMHPIXANCNFSM44EKNA2A .
-- [image: Augmented Human Lab] http://www.ahlab.org/ [image: uni] https://www.auckland.ac.nz/en/abi.html
Gayal Shamane Ph.D. Candidate Augmented Human Lab Auckland Bioengineering Institute | The University of Auckland
yeah, it worked for me. 30Million passages right? … On Thu, May 6, 2021 at 12:30 PM Caplimbo @.***> wrote: I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 32G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTOHMBJKOMG2V5KQE3TMHPIXANCNFSM44EKNA2A . -- [image: Augmented Human Lab] http://www.ahlab.org/ [image: uni] https://www.auckland.ac.nz/en/abi.html Gayal Shamane Ph.D. Candidate Augmented Human Lab Auckland Bioengineering Institute | The University of Auckland
Nope, just a size of 30MB. I attached the file test.csv
in my original comment, and I use use_own_knowledge_dataset.py to process it. And by the way, I have to remove the git
related codes since I have poor connection to github, could this cause potential problems?
Or could you please share your environment settings? I am using dataset==1.0.1
and pyarrow==0.17.1
since higher versions would report errors when using ray, which seems to be the same with what mentioned here https://discuss.ray.io/t/cant-pickle-pyarrow-dataset-expression/1685/7.
Anyway, training on single GPU works smoothly for me, so don't know what might be the problem.
Wow super weired! I will check it out.
On Thu, May 6, 2021, 13:03 Caplimbo @.***> wrote:
yeah, it worked for me. 30Million passages right? … <#m5522908282532654251> On Thu, May 6, 2021 at 12:30 PM Caplimbo @.***> wrote: I'm using a passage set with about 30M, and so is the index, which is obtained by the `test.csv' I attached in the issue(see the first comment). RAM is 32G, so I guess it's not a RAM issue... Have you ever tested using multi-gpus with a customized index? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment) https://github.com/huggingface/transformers/issues/11592#issuecomment-833139689>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTOHMBJKOMG2V5KQE3TMHPIXANCNFSM44EKNA2A . -- [image: Augmented Human Lab] http://www.ahlab.org/ [image: uni] https://www.auckland.ac.nz/en/abi.html Gayal Shamane Ph.D. Candidate Augmented Human Lab Auckland Bioengineering Institute | The University of Auckland
Nope, just a size of 30MB. I attached the file test.csv in my original comment, and I use use_own_knowledge_dataset.py to process it. And by the way, I have to remove the git related codes since I have poor connection to github, could this cause potential problems?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-833150369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTSWP5IR6X5QZ2QY73TMHTERANCNFSM44EKNA2A .
Don't know if it might help, but when using native pytorch dist, I get the following before stucked:
with nvidia-smi
results:
And these PIDs cannot be killed, and I don't know why
And by the way, I also encounters similar problems(cannot go through further training) when using BART and a Trainer
. When I use torch1.8.1with cuda 10.2, I get no output and stucks, but if I switch to cuda11.1, one extra line would occur before stucking:
PyTorch version 1.8.1+cu111 available.
begin loading model ...
end loading model!
begin trainning..
0%| | 0/1000 [00:00<?, ?it/s]/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
And in that case, if I use cuda 11.1, finetuning RAG using pytorch dist would provide extra lines of code:
INFO:lightning:initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
in ddp connection init port -1
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
in ddp connection init port -1
in ddp connection init port -1
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
INFO:distributed_pytorch_retriever:dist initialized
in ddp connection init port -1
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 0
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 3
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 2
INFO:distributed_pytorch_retriever:dist not initialized / main
Loading index from ../covid_QA/try/my_knowledge_dataset_hnsw_index.faiss
Loaded FaissIndex embeddings from ../covid_QA/try/my_knowledge_dataset_hnsw_index.faiss
in ddp end init port -1
in ddp end init port -1
in ddp end init port -1
in ddp end init port -1
But still stucks afterwards. Here I add few extra lines to see what happens, but so far no results I can get
Could you try changing the --distributed-port
just in case ?
Will do later, but what values should I test?
Any value between 4000 and 40000, just to make just it's not an issue with the default value -1
No difference, still stucks at the same step
@Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2
However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this?
What kinda of change you did ?
On Mon, May 10, 2021, 18:38 Caplimbo @.***> wrote:
@Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2
However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-836247855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWONRQDTXQE32BBJMLTM55OBANCNFSM44EKNA2A .
What kinda of change you did ? … On Mon, May 10, 2021, 18:38 Caplimbo @.***> wrote: @Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2 However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWONRQDTXQE32BBJMLTM55OBANCNFSM44EKNA2A .
I added one additional input to the model, and send it to the retriever to modify the retrival process. This is just one scalar tensor per sample, so I don't think itself would cause problem...
So other than the question, you add another tensor right? Then did you also changed what you get output the retriever?
Then did you also modify the input to answer generator?
On Mon, May 10, 2021, 18:42 Caplimbo @.***> wrote:
What kinda of change you did ? … <#m2194681267189680230> On Mon, May 10, 2021, 18:38 Caplimbo @.***> wrote: @Ihoestq I switched the cuda version and can perform training on multiple GPUs now, seems there is an issue with Tesla T4 with cuda 10.2 However, I did a little modification to RAG, and now I can only train it with one GPU, more GPUs would report a OOM problem. Any possible reasons for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment) https://github.com/huggingface/transformers/issues/11592#issuecomment-836247855>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWONRQDTXQE32BBJMLTM55OBANCNFSM44EKNA2A .
I added one additional input to the model, and send it to the retriever to modify the retrival process. This is just one scalar tensor per sample, so I don't think itself would cause problem...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-836250760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUKHBBA73T5IUIXSADTM554NANCNFSM44EKNA2A .
The changes of output to the retriever is minor. For example, suppose one document we retrieve before encoding it has test "This is example text", what I did is prepend it to something like "
The OOM issue didn't happen at the beginning of an epoch by the way, it's like after 10-20 steps that it happens. When using original RAG,I didn't see such behavior of increasing memory demands, does this mean I have something wrong with my own implementation?
It's the same even if I use the original version of RAG
But so far as I can see, using ray for example, with batchsize=1 on each GPU, occupies almost the same amout of GPU memories as training with batchsize=2 on a single GPU. I suspect the parallel implementation also loads part of the retrieval process on GPU?
And by the way, since I have to to some pretraining to the generator part, I separately trained(or tuned) a BART from its pretrained weight on huggingface, and then plunge it in to RAG by RagSequenceForGeneration.from_pretrained_question_encoder_generator
. I'm a little bit worried that RAG model weights provided on huggingface has a different setting with the original BART weight provided, and maybe that's why I cannot get the loss go down like an original RAG does. If that's the case, Should I wait until it converges(like, in 100 epoches?), or should I first separate the generator part from RAG and do my pretraining(tuning) on it? And how can I save the generator part separately?
The changes of output to the retriever is minor. For example, suppose one document we retrieve before encoding it has test "This is example text", what I did is prepend it to something like " This is example text" , and then encode it to get context_input_ids. After this operation, nothing more was done to the generator input.
This might be due to your input size is a bit too much longer. and GPU allocation changes according to the length of your input during the training. So sometimes after few steps, you can get an OOM error. This is kinda the answer to your second issue.
The OOM issue didn't happen at the beginning of an epoch by the way, it's like after 10-20 steps that it happens. When using original RAG,~I didn't see such behavior of increasing memory demands, does this mean I have something wrong with my own implementation?~ It's the same even if I use the original version of RAG But so far as I can see, using ray for example, with batchsize=1 on each GPU, occupies almost the same amout of GPU memories as training with batchsize=2 on a single GPU. I suspect the parallel implementation also loads part of the retrieval process on GPU?
No, when using RAY nothing gets loaded into the GPU. You can see it by using the top command. If your index is around 20 GB, you can find retriever workers occupy that amount of memory.
But it's weird that when using only 1 GPU, I can deploy a batch size of 2 with no OOM, while using 2 GPUs leads to OOM even if batch size (per GPU) is 1. Here batchsize1 OOM only occurs in my modified rag, but with original rag I still get OOM with batch size 2 per device when using multiple GPUs.
I think this is due to low GPU memory. Try to use two 32GB ones. I assume the initialization of the DDP process would consume bit more memory.
@Caplimbo when you run with a single GPU can you send me the memory usage?
@shamanez
Sadly I don't have such powerful GPUs. When I use a single GPU, the memory usage of batch size of 2 should be around 15026MB(as far as I can recall, cannot check it now since I'm using all GPUs for training), while with 4GPUs, I managed to train it with batch size 1 per GPU, with a memory usage of 15072MB/GPU. I have to reduce max_target_length
from 25 to 24, otherwise OOM.
yeah, make sense.
Really? I don't see why using batch size 1 per GPU on a multiple GPU setting would require more memory than using batch size 2 on a single GPU...
Can u send me a screen shot of memory use when you are using a single gpu with batch size one ?
On Wed, May 12, 2021, 23:07 Caplimbo @.***> wrote:
Really? I don't see why using batch size 1 per GPU on a multiple GPU setting would require more memory than using batch size 2 on a single GPU...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-839683794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGVIENWUXQI2H7CJEZDTNJON7ANCNFSM44EKNA2A .
Fine, will do after this round of training is over. Maybe in a day orz.
At the moment just send me one screen shot. With nvidia-smil.
On Wed, May 12, 2021, 23:10 Caplimbo @.***> wrote:
Fine, will do after this round of training is over. Maybe in a day orz.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-839686105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGXSV6IRXU6PHAQ6FNDTNJO3DANCNFSM44EKNA2A .
You mean now with multiple GPU training and batch size 1 on each GPU? then it's like:
See gpu memory is almost up to the limit. So I assume during the DDP the master GPUs requires bit more memory. This causes an OOM error. In my lab I have a 2 11GB GPUs. Sometimes I also observe the same.
On Wed, May 12, 2021, 23:18 Caplimbo @.***> wrote:
You mean now with multiple GPU training and batch size 1 on each GPU? then it's like: [image: image] https://user-images.githubusercontent.com/43310105/117966668-c181ca00-b356-11eb-9504-ec87e61f4c64.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11592#issuecomment-839690195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTUTX5H3JY4C6TKBKDTNJPX5ANCNFSM44EKNA2A .
See gpu memory is almost up to the limit. So I assume during the DDP the master GPUs requires bit more memory. This causes an OOM error. In my lab I have a 2 11GB GPUs. Sometimes I also observe the same. … On Wed, May 12, 2021, 23:18 Caplimbo @.***> wrote: You mean now with multiple GPU training and batch size 1 on each GPU? then it's like: [image: image] https://user-images.githubusercontent.com/43310105/117966668-c181ca00-b356-11eb-9504-ec87e61f4c64.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11592 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTUTX5H3JY4C6TKBKDTNJPX5ANCNFSM44EKNA2A .
Sure, when using pytorch for distributed training such behavior is quite usual, but when using ray... I don't know for sure. Will provide you with more infomation once I can do again single GPU training.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.5.1@patrickvonplaten, @lhoestq
Information
Model I am using: RAG-sequence-base
The problem arises when using:
The tasks I am working on is: *my own task or dataset: A simple test input csv as mentioned in readme of finetune rag repo is used to create my own knowledge. test.csv
To reproduce
Steps to reproduce the behavior:
configs I use
I'm using 4 Tesla T4's as my GPUs, with
faiss-cpu==1.6.3, dataset==1.0.1, pyarrow==0.17.1
, and switching to ray won't solve this problem either.Expected behavior
finish loading index and proceed training