Closed ioannist closed 3 years ago
I was able to get passed this error by using my own RagRetriever instead of RagPyTorchDistributedRetriever inside "transformers/examples/rag/finetune.py"
I am also using my own custom dataset and index #7763
The following changes got me past the missing index error. However, I have no idea if this is efficient or if I am doing something that I shouldn't be doing...
if self.is_rag_model:
if args.prefix is not None:
config.generator.prefix = args.prefix
config.label_smoothing = hparams.label_smoothing
hparams, config.generator = set_extra_model_params(extra_model_params, hparams, config.generator)
# commented out this line
# retriever = RagPyTorchDistributedRetriever.from_pretrained(hparams.model_name_or_path, config=config)
############### new stuff ###############
dataset = load_from_disk(args.passages_path) # to reload the dataset
dataset.load_faiss_index("embeddings", args.index_path) # to reload the index
retriever = RagRetriever.from_pretrained(
hparams.model_name_or_path, index_name="custom", indexed_dataset=dataset
)
######################################
model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever)
prefix = config.question_encoder.prefix
Won't have the time in the next 1,2 weeks to take a closer look sadly. Maybe @lhoestq this is interesting to you
Could you paste the full stacktrace ?
Thank you @lhoestq .
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: False, using: 0 TPU cores
INFO:lightning:TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
File "examples/rag/finetune.py", line 499, in <module>
main(args)
File "examples/rag/finetune.py", line 471, in main
logger=logger,
File "/home/ioannis/Desktop/transformers-lhoestq-2/transformers/examples/lightning_base.py", line 384, in generic_train
trainer.fit(model)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 462, in train
self.run_sanity_check(self.get_model())
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in run_sanity_check
_, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 568, in run_evaluation
output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 171, in evaluation_step
output = self.trainer.accelerator_backend.validation_step(args)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 76, in validation_step
output = self.__validation_step(args)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 86, in __validation_step
output = self.trainer.model.validation_step(*args)
File "examples/rag/finetune.py", line 240, in validation_step
return self._generative_step(batch)
File "examples/rag/finetune.py", line 280, in _generative_step
max_length=self.target_lens["val"],
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/modeling_rag.py", line 873, in generate
return_tensors="pt",
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/retrieval_rag.py", line 600, in __call__
retrieved_doc_embeds, doc_ids, docs = self.retrieve(question_hidden_states, n_docs)
File "/home/ioannis/Desktop/transformers-lhoestq-2/transformers/examples/rag/distributed_retriever.py", line 115, in retrieve
doc_ids, retrieved_doc_embeds = self._main_retrieve(question_hidden_states, n_docs)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/retrieval_rag.py", line 521, in _main_retrieve
ids, vectors = self.index.get_top_docs(question_hidden_states, n_docs)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/retrieval_rag.py", line 226, in get_top_docs
_, ids = self.dataset.search_batch("embeddings", question_hidden_states, n_docs)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/datasets/search.py", line 607, in search_batch
self._check_index_is_initialized(index_name)
File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/datasets/search.py", line 358, in _check_index_is_initialized
f"Index with index_name '{index_name}' not initialized yet. Please make sure that you call `add_faiss_index` or `add_elasticsearch_index` first."
datasets.search.MissingIndex: Index with index_name 'embeddings' not initialized yet. Please make sure that you call `add_faiss_index` or `add_elasticsearch_index` first.
I was able to get passed this error by using my own RagRetriever instead of RagPyTorchDistributedRetriever inside "transformers/examples/rag/finetune.py"
I am also using my own custom dataset and index #7763
The following changes got me past the missing index error. However, I have no idea if this is efficient or if I am doing something that I shouldn't be doing...
if self.is_rag_model: if args.prefix is not None: config.generator.prefix = args.prefix config.label_smoothing = hparams.label_smoothing hparams, config.generator = set_extra_model_params(extra_model_params, hparams, config.generator) # commented out this line # retriever = RagPyTorchDistributedRetriever.from_pretrained(hparams.model_name_or_path, config=config) ############### new stuff ############### dataset = load_from_disk(args.passages_path) # to reload the dataset dataset.load_faiss_index("embeddings", args.index_path) # to reload the index retriever = RagRetriever.from_pretrained( hparams.model_name_or_path, index_name="custom", indexed_dataset=dataset ) ###################################### model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever) prefix = config.question_encoder.prefix
The above code seems to work (runs out of GPU memory in my local machine, so I am in the process of testing it on a server - will keep you posted).
I noticed that the retrieval step 4 in _/examples/rag/use_own_knowledgedataset.py takes a few minutes for every question, so I tried passing in device=0 to faiss to move it from cpu to gpu. I got this:
Faiss assertion 'blasStatus == CUBLAS_STATUS_SUCCESS' failed in virtual void faiss::gpu::StandardGpuResources::initializeForDevice(int) at gpu/StandardGpuResources.cpp:248
The idea was to speed it up because I don't see how the finetuning can take place with such a slow index, but I might have misunderstood.
Seems like my attempt to replace RagPyTorchDistributedRetriever with RagRetriever (in an 8 GPU machine) fails too. Too good to be true :)
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/question_encoder_tokenizer/tokenizer_config.json from cache at /home/ubuntu/.cache/torch/transformers/8ade9cf561f8c0a47d1c3785e850c57414d776b3795e21bd01e58483399d2de4.11f57497ee659e26f830788489816dbcb678d91ae48c06c50c9dc0e4438ec05b
Model name 'facebook/rag-sequence-base/generator_tokenizer' not found in model shortcut name list (facebook/bart-base, facebook/bart-large, facebook/bart-large-mnli, facebook/bart-large-cnn, facebook/bart-large-xsum, yjernite/bart_eli5). Assuming 'facebook/rag-sequence-base/generator_tokenizer' is a path, a model identifier, or url to a directory containing tokenizer files.
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/vocab.json from cache at /home/ubuntu/.cache/torch/transformers/3b9637b6eab4a48cf2bc596e5992aebb74de6e32c9ee660a27366a63a8020557.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/merges.txt from cache at /home/ubuntu/.cache/torch/transformers/b2a6adcb3b8a4c39e056d80a133951b99a56010158602cf85dee775936690c6a.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/added_tokens.json from cache at None
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/special_tokens_map.json from cache at /home/ubuntu/.cache/torch/transformers/342599872fb2f45f954699d3c67790c33b574cc552a4b433fedddc97e6a3c58e.6e217123a3ada61145de1f20b1443a1ec9aac93492a4bd1ce6a695935f0fd97a
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/tokenizer_config.json from cache at /home/ubuntu/.cache/torch/transformers/e5f72dc4c0b1ba585d7afb7fa5e3e52ff0e1f101e49572e2caaf38fab070d4d6.d596a549211eb890d3bb341f3a03307b199bc2d5ed81b3451618cbcb04d1f1bc
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/tokenizer.json from cache at None
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
INFO:__main__:Custom init_ddp_connection.
initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
INFO:lightning:initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
Traceback (most recent call last):
File "/home/ubuntu/transformers/examples/rag/finetune.py", line 519, in <module>
main(args)
File "/home/ubuntu/transformers/examples/rag/finetune.py", line 491, in main
logger=logger,
File "/home/ubuntu/transformers/examples/lightning_base.py", line 384, in generic_train
trainer.fit(model)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
self.accelerator_backend.train(model)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train
self.trainer.is_slurm_managing_tasks
File "/home/ubuntu/transformers/examples/rag/finetune.py", line 180, in init_ddp_connection
self.model.retriever.init_retrieval(self.distributed_port)
TypeError: init_retrieval() takes 1 positional argument but 2 were given
Traceback (most recent call last):
File "examples/rag/finetune.py", line 519, in <module>
main(args)
File "examples/rag/finetune.py", line 491, in main
logger=logger,
File "/home/ubuntu/transformers/examples/lightning_base.py", line 384, in generic_train
trainer.fit(model)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
results = self.accelerator_backend.spawn_ddp_children(model)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train
self.trainer.is_slurm_managing_tasks
File "examples/rag/finetune.py", line 180, in init_ddp_connection
self.model.retriever.init_retrieval(self.distributed_port)
TypeError: init_retrieval() takes 1 positional argument but 2 were given
There are differences between the regular retriever and the distributed retriever:
Let me know if you figure out a way to make it work in your case
Will do, though I guess it's easier to go back to trying to make it work with RagPyTorchDistributedRetriever.
Tried adding _dataset.load_faissindex inside get_dataset in finetune.py, but... 'Seq2SeqDataset' object has no attribute 'load_faiss_index'
The Seq2SeqDataset is the one the model is trained on. The knowledge dataset is stored inside the retriever.
The MissingIndex
error must come from init_retrieval not being called on the retriever in the process 0, or that the index is not properly loaded.
Hi @lhoestq @patrickvonplaten any update on this? I'm also running into this issue when running finetune.sh. Though I am able to get the legacy index to work.
@amogkam I also get the same error when trying to run fine-tuning. I also got an error saying self.opt is not there, but I did solve it.
What do you mean by legacy index?
I'll investigate this error this week. I'll let you know how it goes
@lhoestq
I actually did change the initialization in this line (retrieval_rag.py).
self.dataset_name, with_index=True,index_name=exact, split=self.dataset_split, dummy=self.use_dummy_dataset
That's good to know thanks !
However for the RagPyTorchDistributedRetriever we need to load the index only on the process 0 and keep with_index=False
for the other processes. Ideally we have with_index=False
in the __init__
and with_index=True
in init_index
Oh get it!
On Tue, Nov 10, 2020, 04:56 Quentin Lhoest notifications@github.com wrote:
That's good to know thanks ! However for the RagPyTorchDistributedRetriever we need to load the index only on the process 0 and keep with_index=False for the other processes. Ideally we have with_index=False in the init and with_index=True in init_index
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724102348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGQAIXISSTMTQW6L6KDSPAGKLANCNFSM4SSCPP5Q .
Sorry for spamming. I find it hard to understand index_name and index_paths when loading the datasets with fairsis
On Tue, Nov 10, 2020, 04:56 Quentin Lhoest notifications@github.com wrote:
That's good to know thanks ! However for the RagPyTorchDistributedRetriever we need to load the index only on the process 0 and keep with_index=False for the other processes. Ideally we have with_index=False in the init and with_index=True in init_index
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724102348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGQAIXISSTMTQW6L6KDSPAGKLANCNFSM4SSCPP5Q .
You can specify index_name if you want to use one the index that comes with the dataset (exact/compressed), OR you can use index_path to use your own local index file.
So the index name is like a column right ? Which controls whether thah column should get loaded in to memory or not ?
On Wed, Nov 11, 2020, 02:35 Quentin Lhoest notifications@github.com wrote:
You can specify index_name if you want to use one the index that comes with the dataset (exact/compressed), OR you can use index_path to use your own local index file.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724705036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUF5V5GO2KCEGIBH7TSPE6RHANCNFSM4SSCPP5Q .
In the RAG configuration you can specify index_name="exact" or index_name="compressed" for the "wiki_dpr" dataset. Wiki_dpr has indeed those two types of index. For more info you can check the docs of the RagConfig
On the other hand in the datasets library and in particular in Dataset.add_faiss_index
you can also see an "index_name" parameter. However this one is different from the one used in the RAG configuration on transformers side. In the datasets library, each dataset can have several indexes that are identified by their names, and by default their names correspond to the column that was used to build the index. See the docs of the add_faiss_index method
This is unfortunately the same variable name but not for the same purpose... Does that make sense to you ?
Thanks a lot. I got the idea.
@lhoestq Btw I tried to run the rag fine-tuning script with a lower PyTorch lightning (0.9) version and it worked. I think the issue comes with version miss-match.
On Wed, Nov 11, 2020, 02:46 Quentin Lhoest notifications@github.com wrote:
In the RAG configuration you can specify index_name="exact" or index_name="compressed" for the "wiki_dpr" dataset. Wiki_dpr has indeed those two types of index. For more info you can check the docs of the RagConfig https://huggingface.co/transformers/model_doc/rag.html#ragconfig
On the other hand in the datasets library and in particular in Dataset.add_faiss_index you can also see an "index_name" parameter. However this one is different from the one used in the RAG configuration on transformers side. In the datasets library, each dataset can have several indexes that are identified by their names, and by default their names correspond to the column that was used to build the index. See the docs of the add_faiss_index method https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.add_faiss_index
This is unfortunately the same variable name but not for the same purpose... Does that make sense to you ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724711118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUB6XY3LU3EBGEV2D3SPE7Z3ANCNFSM4SSCPP5Q .
I managed to reproduce the issue, I'm working on a fix
Perfect.
On Fri, Nov 13, 2020, 00:12 Quentin Lhoest notifications@github.com wrote:
I managed to reproduce the issue, I'm working on a fix
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-726012277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUUQ6UV4KCAZGRLCWTSPO7LDANCNFSM4SSCPP5Q .
Update: it looks like it's because pytorch lightning removed the init_ddp_connection hook of their LightningModule. The hook was used to initialize the index on process 0. I'll use something else to initialize the index.
Ok, that's why the code still works with PL 0.9.
So now the problem is the initialization of the index in this line ?
Thanks a lot.
@lhoestq any update with this, please?
p.s sorry for spamming :)
Yes I'm working on a fix ! I'll make a PR tomorrow
Thanks a lot. :)
Environment info
transformers version: 3.3.1 Platform: Ubuntu Python version:3.6.12 PyTorch version (GPU: yes): 1.6.0 Using GPU in script?: 1 gpu Using distributed or parallel set-up in script?: no
Who can help
@patrickvonplaten @sgugger
Information
model name: facebook/rag-sequence-base
The problem arises when using the official example scripts: (give details below) The tasks I am working on is my own task or dataset: (give details below)
To reproduce
1) Make directory at examples/rag/ioannis-data and add train eval and test files in the directory 2) Run transformers/examples/rag/finetune.sh with following changes: --data_dir examples/rag/ioannis-data \ --output_dir examples/rag/ioannis-output \ --model_name_or_path facebook/rag-sequence-base
The script terminates with the following error:
datasets.search.MissingIndex: Index with index_name 'embeddings' not initialized yet. Please make sure that you call 'add_faiss_index' or 'add_elasticsearch_index' first.