Research Notes on RAG - Githubissues

carolina-museum commented 1 month ago

This is a thread for Carolina to summarize her research on RAG. The purpose is to share the information among project members.

carolina-museum commented 1 month ago

Here is a recent survey paper that thoroughly overviews Retrieval-Augmented Generation (RAG).

@article{gao2023retrieval, title={Retrieval-augmented generation for large language models: A survey}, author={Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Haofen}, journal={arXiv preprint arXiv:2312.10997}, year={2023} } Link: https://arxiv.org/abs/2312.10997

Abstract:

Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs' intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval, the generation and the augmentation techniques. The paper highlights the state-of-the-art technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces up-to-date evaluation framework and benchmark. At the end, this article delineates the challenges currently faced and points out prospective avenues for research and development.

The summarized main points of this paper that Carolina wants to share with other members are listed here.

⭐️Figure 2 shows the overview of RAG architecture. Here is how it works:

Context texts are loaded into the database. They are embedded so that it is easier to search for the most relevant information.
A user asks a question (inputs a query)
The query is embedded and most related contexts are retrieved from the database. The distance between the embedded query and the embedded context determines the similarity between the query and the contexts.
The k most related contexts(in text) are combined with the user's query(in text) to generate a better question (query in text) to feed to an LLM. Better question: a question that includes the context and asks to say "I don't know" if the information is not present in the context.
The query generated from the last step is fed to an LLM model, and the LLM model gives the user an answer based on the query from the last step. The user receives this answer.

⭐️Table 1 is the summary of RAG methods. The columns are: Method Retrieval, Source Retrieval, DataType, Retrieval Granularity, Augmentation Stage, Retrieval process From this table, I learned:

Retrieval data type is mostly in text
The granularities of retrieval can be in units of proposition, phrase, sentence, sentence pair, document, chunk, etc. (The rest of the list are: Item-base, Multi, Sub-graph, entity, entity sequence)

⭐️Process in RAG: Indexing, Retrieval, Generation

Indexing: clean the source into plain text, divide it into appropriate chunks, and embed it into vectors.
Retrieval: Get top k chunks that are similar to the query from the user
Generation: integrate the retrieved chunks and the query into one prompt. Make the query more clear so that the user gets desired responses from LLM.

⭐️Challenges in RAG

Retrieval: Might bring irrelevant information
Generation: This process itself can cause hallucinations. This process might make prompts that are more confusing or misleading to the LLM model, degrading the response quality.
Augmentation: integration of retrieved chunks is not easy. The prompt might not make any sense. The prompt may include too much information that is similar to each other.
The model might simply echo the information from DB. (This is not too bad for our application..(?))

⭐️Improving naive RAG

Add processes before and after retrieval, such as modifying indexing methods optimally.
- pre-retrieval
  - enhancing data granularity, optimizing index structures, adding metadata, alignment optimization, and mixed retrieval
  - query rewriting query transformation, query expansion

post-retrieval

rerank chunks and context compressing

selecting the essential information, emphasizing critical sections, and shortening the context to be processed

Add modules

Search module: look for data efficiently

Memory module: save data in a specific way so that the search is easier

Predict module: remove unnecessaries like similar information and extra information

Task Adapter module: allow zero-shot inputs, without being task specific

New Patterns

update and improve the model by rewriting the prompt using feedback

complex search unit (keyword, semantic, vector)

flexible architecture, replacing a part of the model to adapt to each use case

I will keep reading this paper and add more information. I will also look into other sources for important concepts for RAG.

gomyway1216 commented 1 month ago

"evaluation framework and benchmark" can be an interesting point to dig in. I was first thinking that validation can be done by our intuition, but if there is any unbiased way, that would make our model stronger and more persuasive.

carolina-museum commented 1 month ago

Continuing to the previous comment on a recent survey paper, here is the summary of Section 4, Task and Evaluation.

Article{gao2023retrieval, title={Retrieval-augmented generation for large language models: A survey}, author={Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Haofen}, journal={arXiv preprint arXiv:2312.10997}, year={2023} } Link: https://arxiv.org/abs/2312.10997

⭐️Table 2 lists sub-tasks, datasets, and methods. The sub-tasks are Question Answering (QA), Dialog, Information Extraction, Reasoning, and others. In our project, QA and Information Extraction are important.

⭐️Evaluation Target

Question answering: EM score, F1 score
Fact-checking: Accuracy
Answer quality: BLEU score, ROUGE metric

RALLE is an evaluation tool for RAG that uses the above metrics to evaluate RAG applications.

Evaluation objectives:

Retrieval quality. Metrics: Hit Rate, MRR, NDCG
Generation quality. Metrics: accuracy for labeled data. faithfulness, relevance, and non-harmfulness for unlabeled data.

⭐️Evaluation Aspects

Quality Scores
- Context Relevance: is the information retrieved from DB helpful to derive good answers?
- Answer Faithfulness: is the answer consistent? Does it stick to the information from the database?
- Answer Relevance: does the answer answer the question?
Required Abilities
- Noise Robustness: does it ignore related but not important information?
- Negative Rejection: is the model able to say "I don't know"?
- Information Integration: can the model pull information from several sources and integrate them appropriately?
- Counterfactual Robustness: can the model still give the right answer when the prompt is not accurate? There is no standard method to evaluate RAG. Authors of each paper tailor the evaluation to the tasks of their own.

⭐️Evaluation Benchmarks and Tools

Benchmarks: RGB, RECALL, CRUD
Tools: RAGAS, ARES, TruLens

huyfififi commented 1 month ago

How about we add these notes as Markdown in the repository? maybe we can close this issue that way

carolina-museum commented 1 month ago

That is a great idea! I will make a markdown file for this topic.

gomyway1216 / rag

Research Notes on RAG #26