huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

RAG - how to precompute custom document index? #7462

Closed aced125 closed 3 years ago

aced125 commented 3 years ago

Was wondering if there was any code snippet / blog post showing how one could load their own documents and index them, so they can be used by the RAG retriever.

Cheers!

Weilin37 commented 3 years ago

Second this.

https://github.com/deepset-ai/haystack may be useful to you. They leverage huggingface and have an DPR implementation with an end-to-end example. Will not be surprised to see RAG implemented soon.

aced125 commented 3 years ago

@Weilin37 Thanks. I'm also looking at the Faiss docs now (https://github.com/facebookresearch/faiss/wiki/Faiss-indexes).

patrickvonplaten commented 3 years ago

@lhoestq can maybe help here as well

lhoestq commented 3 years ago

Yep I'm thinking of adding a script in examples/rag that shows how to create an indexed dataset for RAG. I'll let you know how it goes

NatBala commented 3 years ago

@lhoestq Can you please let me know on how we can index the custom datasets? Appreciate your help on this

NatBala commented 3 years ago

@lhoestq I have a bunch of documents to perform Q&A and currently, in the config it says, dataset (str, optional, defaults to "wiki_dpr") – A dataset identifier of the indexed dataset on HuggingFace AWS bucket (list all available datasets and ids using datasets.list_datasets()). So how can we create an indexed file and input that to the pretrained model for evaluation.

lhoestq commented 3 years ago

@lhoestq I have a bunch of documents to perform Q&A and currently, in the config it says, dataset (str, optional, defaults to "wiki_dpr") – A dataset identifier of the indexed dataset on HuggingFace AWS bucket (list all available datasets and ids using datasets.list_datasets()). So how can we create an indexed file and input that to the pretrained model for evaluation.

Yes right... We'll have to edit the RagRetriever and the HfIndex to accept custom ones. If you wanto to give it a try in the meantime, feel free to do so :)

aced125 commented 3 years ago

Any progress on this @lhoestq @patrickvonplaten ? Awesome work guys :)

aced125 commented 3 years ago

@tholor @Timoeller Do you reckon you guys could integrate this work into haystack?

tholor commented 3 years ago

@aced125 Yep, we will integrate RAG in Haystack soon (https://github.com/deepset-ai/haystack/issues/443).

lhoestq commented 3 years ago

Any progress on this @lhoestq @patrickvonplaten ? Awesome work guys :)

You can expect a PR by tomorrow

Laksh1997 commented 3 years ago

Awesome thanks everyone @tholor @lhoestq @patrickvonplaten !!!!

NatBala commented 3 years ago

Thank you @lhoestq . Really appreciate for getting back quickly on this issue.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

krishanudb commented 3 years ago

Hello everyone, I am interesting in studying how RAG behaves without the DPR retriever. For example in the code below

``from transformers import RagRetriever from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

retriever = RagRetriever.from_pretrained('./rag-token-nq', indexed_dataset=dataset) tokenizer = RagTokenizer.from_pretrained("./rag-token-nq") model = RagTokenForGeneration.from_pretrained("./rag-token-nq", retriever=retriever)

input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", "In Paris, there are 10 million people.", return_tensors="pt") input_ids = input_dict["input_ids"]

model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

generated_ids = model.generate(input_ids=input_ids, labels=input_dict["labels"])

generated_string = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) print(generated_string) ``

In the line '' input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", "In Paris, there are 10 million people.", return_tensors="pt") ``, I want to use "How many people live in Paris ?" as the question and "In Paris, there are 10 million people." the passage / context which should be used to generate the answer.

Kindly let me know how to do this?

Is my understanding of the code correct and if not, how to go about it?

Thanks, Krishanu

lhoestq commented 3 years ago

For RAG you can pass both your question as input_ids and your context as context_input_ids to model.generate. You can provide several contexts for one question.

You can find more information in the documentation here

krishanudb commented 3 years ago

@lhoestq Thanks for the reply. There is this doc_score parameter in the model.generate function. Is it necessary or optional?

lhoestq commented 3 years ago

If you pass the context_input_ids you also need to provide the doc_scores indeed.