huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.02k stars 27.02k forks source link

RAG - MissingIndex: Index with index_name 'embeddings' not initialized yet #7816

Closed ioannist closed 3 years ago

ioannist commented 4 years ago

Environment info

transformers version: 3.3.1 Platform: Ubuntu Python version:3.6.12 PyTorch version (GPU: yes): 1.6.0 Using GPU in script?: 1 gpu Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten @sgugger

Information

model name: facebook/rag-sequence-base

The problem arises when using the official example scripts: (give details below) The tasks I am working on is my own task or dataset: (give details below)

To reproduce

1) Make directory at examples/rag/ioannis-data and add train eval and test files in the directory 2) Run transformers/examples/rag/finetune.sh with following changes: --data_dir examples/rag/ioannis-data \ --output_dir examples/rag/ioannis-output \ --model_name_or_path facebook/rag-sequence-base

The script terminates with the following error:

datasets.search.MissingIndex: Index with index_name 'embeddings' not initialized yet. Please make sure that you call 'add_faiss_index' or 'add_elasticsearch_index' first.

ioannist commented 4 years ago

I was able to get passed this error by using my own RagRetriever instead of RagPyTorchDistributedRetriever inside "transformers/examples/rag/finetune.py"

I am also using my own custom dataset and index #7763

The following changes got me past the missing index error. However, I have no idea if this is efficient or if I am doing something that I shouldn't be doing...

if self.is_rag_model:
    if args.prefix is not None:
        config.generator.prefix = args.prefix
    config.label_smoothing = hparams.label_smoothing
    hparams, config.generator = set_extra_model_params(extra_model_params, hparams, config.generator)

    # commented out this line
    # retriever = RagPyTorchDistributedRetriever.from_pretrained(hparams.model_name_or_path, config=config)

    ############### new stuff ###############
    dataset = load_from_disk(args.passages_path)  # to reload the dataset
    dataset.load_faiss_index("embeddings", args.index_path)  # to reload the index
    retriever = RagRetriever.from_pretrained(
        hparams.model_name_or_path, index_name="custom", indexed_dataset=dataset
    )
    ######################################

    model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever)
    prefix = config.question_encoder.prefix
patrickvonplaten commented 4 years ago

Won't have the time in the next 1,2 weeks to take a closer look sadly. Maybe @lhoestq this is interesting to you

lhoestq commented 4 years ago

Could you paste the full stacktrace ?

ioannist commented 4 years ago

Thank you @lhoestq .

GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: False, using: 0 TPU cores
INFO:lightning:TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "examples/rag/finetune.py", line 499, in <module>
    main(args)
  File "examples/rag/finetune.py", line 471, in main
    logger=logger,
  File "/home/ioannis/Desktop/transformers-lhoestq-2/transformers/examples/lightning_base.py", line 384, in generic_train
    trainer.fit(model)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 462, in train
    self.run_sanity_check(self.get_model())
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 568, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 171, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 76, in validation_step
    output = self.__validation_step(args)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 86, in __validation_step
    output = self.trainer.model.validation_step(*args)
  File "examples/rag/finetune.py", line 240, in validation_step
    return self._generative_step(batch)
  File "examples/rag/finetune.py", line 280, in _generative_step
    max_length=self.target_lens["val"],
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/modeling_rag.py", line 873, in generate
    return_tensors="pt",
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/retrieval_rag.py", line 600, in __call__
    retrieved_doc_embeds, doc_ids, docs = self.retrieve(question_hidden_states, n_docs)
  File "/home/ioannis/Desktop/transformers-lhoestq-2/transformers/examples/rag/distributed_retriever.py", line 115, in retrieve
    doc_ids, retrieved_doc_embeds = self._main_retrieve(question_hidden_states, n_docs)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/retrieval_rag.py", line 521, in _main_retrieve
    ids, vectors = self.index.get_top_docs(question_hidden_states, n_docs)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/transformers/retrieval_rag.py", line 226, in get_top_docs
    _, ids = self.dataset.search_batch("embeddings", question_hidden_states, n_docs)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/datasets/search.py", line 607, in search_batch
    self._check_index_is_initialized(index_name)
  File "/home/ioannis/anaconda3/envs/transformers-custom-dataset-in-rag-retriever/lib/python3.7/site-packages/datasets/search.py", line 358, in _check_index_is_initialized
    f"Index with index_name '{index_name}' not initialized yet. Please make sure that you call `add_faiss_index` or `add_elasticsearch_index` first."
datasets.search.MissingIndex: Index with index_name 'embeddings' not initialized yet. Please make sure that you call `add_faiss_index` or `add_elasticsearch_index` first.
ioannist commented 4 years ago

I was able to get passed this error by using my own RagRetriever instead of RagPyTorchDistributedRetriever inside "transformers/examples/rag/finetune.py"

I am also using my own custom dataset and index #7763

The following changes got me past the missing index error. However, I have no idea if this is efficient or if I am doing something that I shouldn't be doing...

if self.is_rag_model:
    if args.prefix is not None:
        config.generator.prefix = args.prefix
    config.label_smoothing = hparams.label_smoothing
    hparams, config.generator = set_extra_model_params(extra_model_params, hparams, config.generator)

    # commented out this line
    # retriever = RagPyTorchDistributedRetriever.from_pretrained(hparams.model_name_or_path, config=config)

    ############### new stuff ###############
    dataset = load_from_disk(args.passages_path)  # to reload the dataset
    dataset.load_faiss_index("embeddings", args.index_path)  # to reload the index
    retriever = RagRetriever.from_pretrained(
        hparams.model_name_or_path, index_name="custom", indexed_dataset=dataset
    )
    ######################################

    model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever)
    prefix = config.question_encoder.prefix

The above code seems to work (runs out of GPU memory in my local machine, so I am in the process of testing it on a server - will keep you posted).

I noticed that the retrieval step 4 in _/examples/rag/use_own_knowledgedataset.py takes a few minutes for every question, so I tried passing in device=0 to faiss to move it from cpu to gpu. I got this:

Faiss assertion 'blasStatus == CUBLAS_STATUS_SUCCESS' failed in virtual void faiss::gpu::StandardGpuResources::initializeForDevice(int) at gpu/StandardGpuResources.cpp:248

The idea was to speed it up because I don't see how the finetuning can take place with such a slow index, but I might have misunderstood.

ioannist commented 4 years ago

Seems like my attempt to replace RagPyTorchDistributedRetriever with RagRetriever (in an 8 GPU machine) fails too. Too good to be true :)

loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/question_encoder_tokenizer/tokenizer_config.json from cache at /home/ubuntu/.cache/torch/transformers/8ade9cf561f8c0a47d1c3785e850c57414d776b3795e21bd01e58483399d2de4.11f57497ee659e26f830788489816dbcb678d91ae48c06c50c9dc0e4438ec05b
Model name 'facebook/rag-sequence-base/generator_tokenizer' not found in model shortcut name list (facebook/bart-base, facebook/bart-large, facebook/bart-large-mnli, facebook/bart-large-cnn, facebook/bart-large-xsum, yjernite/bart_eli5). Assuming 'facebook/rag-sequence-base/generator_tokenizer' is a path, a model identifier, or url to a directory containing tokenizer files.
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/vocab.json from cache at /home/ubuntu/.cache/torch/transformers/3b9637b6eab4a48cf2bc596e5992aebb74de6e32c9ee660a27366a63a8020557.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/merges.txt from cache at /home/ubuntu/.cache/torch/transformers/b2a6adcb3b8a4c39e056d80a133951b99a56010158602cf85dee775936690c6a.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/added_tokens.json from cache at None
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/special_tokens_map.json from cache at /home/ubuntu/.cache/torch/transformers/342599872fb2f45f954699d3c67790c33b574cc552a4b433fedddc97e6a3c58e.6e217123a3ada61145de1f20b1443a1ec9aac93492a4bd1ce6a695935f0fd97a
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/tokenizer_config.json from cache at /home/ubuntu/.cache/torch/transformers/e5f72dc4c0b1ba585d7afb7fa5e3e52ff0e1f101e49572e2caaf38fab070d4d6.d596a549211eb890d3bb341f3a03307b199bc2d5ed81b3451618cbcb04d1f1bc
loading file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/rag-sequence-base/generator_tokenizer/tokenizer.json from cache at None
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
INFO:__main__:Custom init_ddp_connection.
initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
INFO:lightning:initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
Traceback (most recent call last):
  File "/home/ubuntu/transformers/examples/rag/finetune.py", line 519, in <module>
    main(args)
  File "/home/ubuntu/transformers/examples/rag/finetune.py", line 491, in main
    logger=logger,
  File "/home/ubuntu/transformers/examples/lightning_base.py", line 384, in generic_train
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
    self.accelerator_backend.train(model)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
    self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/ubuntu/transformers/examples/rag/finetune.py", line 180, in init_ddp_connection
    self.model.retriever.init_retrieval(self.distributed_port)
TypeError: init_retrieval() takes 1 positional argument but 2 were given
Traceback (most recent call last):
  File "examples/rag/finetune.py", line 519, in <module>
    main(args)
  File "examples/rag/finetune.py", line 491, in main
    logger=logger,
  File "/home/ubuntu/transformers/examples/lightning_base.py", line 384, in generic_train
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/tran/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "examples/rag/finetune.py", line 180, in init_ddp_connection
    self.model.retriever.init_retrieval(self.distributed_port)
TypeError: init_retrieval() takes 1 positional argument but 2 were given
lhoestq commented 4 years ago

There are differences between the regular retriever and the distributed retriever:

Let me know if you figure out a way to make it work in your case

ioannist commented 4 years ago

Will do, though I guess it's easier to go back to trying to make it work with RagPyTorchDistributedRetriever.

Tried adding _dataset.load_faissindex inside get_dataset in finetune.py, but... 'Seq2SeqDataset' object has no attribute 'load_faiss_index'

lhoestq commented 4 years ago

The Seq2SeqDataset is the one the model is trained on. The knowledge dataset is stored inside the retriever. The MissingIndex error must come from init_retrieval not being called on the retriever in the process 0, or that the index is not properly loaded.

amogkam commented 4 years ago

Hi @lhoestq @patrickvonplaten any update on this? I'm also running into this issue when running finetune.sh. Though I am able to get the legacy index to work.

shamanez commented 4 years ago

@amogkam I also get the same error when trying to run fine-tuning. I also got an error saying self.opt is not there, but I did solve it.

What do you mean by legacy index?

lhoestq commented 4 years ago

I'll investigate this error this week. I'll let you know how it goes

shamanez commented 4 years ago

@lhoestq

I actually did change the initialization in this line (retrieval_rag.py).

self.dataset_name, with_index=True,index_name=exact, split=self.dataset_split, dummy=self.use_dummy_dataset

lhoestq commented 4 years ago

That's good to know thanks ! However for the RagPyTorchDistributedRetriever we need to load the index only on the process 0 and keep with_index=False for the other processes. Ideally we have with_index=False in the __init__ and with_index=True in init_index

shamanez commented 4 years ago

Oh get it!

On Tue, Nov 10, 2020, 04:56 Quentin Lhoest notifications@github.com wrote:

That's good to know thanks ! However for the RagPyTorchDistributedRetriever we need to load the index only on the process 0 and keep with_index=False for the other processes. Ideally we have with_index=False in the init and with_index=True in init_index

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724102348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGQAIXISSTMTQW6L6KDSPAGKLANCNFSM4SSCPP5Q .

shamanez commented 4 years ago

Sorry for spamming. I find it hard to understand index_name and index_paths when loading the datasets with fairsis

On Tue, Nov 10, 2020, 04:56 Quentin Lhoest notifications@github.com wrote:

That's good to know thanks ! However for the RagPyTorchDistributedRetriever we need to load the index only on the process 0 and keep with_index=False for the other processes. Ideally we have with_index=False in the init and with_index=True in init_index

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724102348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGQAIXISSTMTQW6L6KDSPAGKLANCNFSM4SSCPP5Q .

lhoestq commented 4 years ago

You can specify index_name if you want to use one the index that comes with the dataset (exact/compressed), OR you can use index_path to use your own local index file.

shamanez commented 4 years ago

So the index name is like a column right ? Which controls whether thah column should get loaded in to memory or not ?

On Wed, Nov 11, 2020, 02:35 Quentin Lhoest notifications@github.com wrote:

You can specify index_name if you want to use one the index that comes with the dataset (exact/compressed), OR you can use index_path to use your own local index file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724705036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUF5V5GO2KCEGIBH7TSPE6RHANCNFSM4SSCPP5Q .

lhoestq commented 4 years ago

In the RAG configuration you can specify index_name="exact" or index_name="compressed" for the "wiki_dpr" dataset. Wiki_dpr has indeed those two types of index. For more info you can check the docs of the RagConfig

On the other hand in the datasets library and in particular in Dataset.add_faiss_index you can also see an "index_name" parameter. However this one is different from the one used in the RAG configuration on transformers side. In the datasets library, each dataset can have several indexes that are identified by their names, and by default their names correspond to the column that was used to build the index. See the docs of the add_faiss_index method

This is unfortunately the same variable name but not for the same purpose... Does that make sense to you ?

shamanez commented 4 years ago

Thanks a lot. I got the idea.

@lhoestq Btw I tried to run the rag fine-tuning script with a lower PyTorch lightning (0.9) version and it worked. I think the issue comes with version miss-match.

On Wed, Nov 11, 2020, 02:46 Quentin Lhoest notifications@github.com wrote:

In the RAG configuration you can specify index_name="exact" or index_name="compressed" for the "wiki_dpr" dataset. Wiki_dpr has indeed those two types of index. For more info you can check the docs of the RagConfig https://huggingface.co/transformers/model_doc/rag.html#ragconfig

On the other hand in the datasets library and in particular in Dataset.add_faiss_index you can also see an "index_name" parameter. However this one is different from the one used in the RAG configuration on transformers side. In the datasets library, each dataset can have several indexes that are identified by their names, and by default their names correspond to the column that was used to build the index. See the docs of the add_faiss_index method https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.add_faiss_index

This is unfortunately the same variable name but not for the same purpose... Does that make sense to you ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-724711118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUB6XY3LU3EBGEV2D3SPE7Z3ANCNFSM4SSCPP5Q .

lhoestq commented 4 years ago

I managed to reproduce the issue, I'm working on a fix

shamanez commented 4 years ago

Perfect.

On Fri, Nov 13, 2020, 00:12 Quentin Lhoest notifications@github.com wrote:

I managed to reproduce the issue, I'm working on a fix

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7816#issuecomment-726012277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUUQ6UV4KCAZGRLCWTSPO7LDANCNFSM4SSCPP5Q .

lhoestq commented 4 years ago

Update: it looks like it's because pytorch lightning removed the init_ddp_connection hook of their LightningModule. The hook was used to initialize the index on process 0. I'll use something else to initialize the index.

shamanez commented 4 years ago

Ok, that's why the code still works with PL 0.9.

So now the problem is the initialization of the index in this line ?

Thanks a lot.

shamanez commented 4 years ago

@lhoestq any update with this, please?

p.s sorry for spamming :)

lhoestq commented 4 years ago

Yes I'm working on a fix ! I'll make a PR tomorrow

shamanez commented 4 years ago

Thanks a lot. :)