gomyway1216 / rag

0 stars 2 forks source link

What is the best chunk size? #27

Open carolina-museum opened 1 month ago

carolina-museum commented 1 month ago

Determine the best chunk size for our application.

The application should be able to keep track of individuals, so that RAG can make an appropriate prompt to feed to LLM, and the application can give an individualized piece of advice to the user.

The input is in text. They are actual notes of life problems and solutions of one of our project members.

Larger chunks use less memory to store but the information is embedded coarsely which causes to loss of some information. However, larger chunks are better in keeping the local context. Smaller chunks use more memory but the information can be kept in granular elements. The search complexity is higher because of the larger number of chunks. (Assuming that the size of the vector to represent a chunk is the same in both cases.)

Let's find the best chunk size by both researching and experimenting. It is important to balance them! Do not research too much and build theories on theories.

carolina-museum commented 1 month ago

Here is a recent survey paper that thoroughly overviews Retrieval-Augmented Generation (RAG).

@article{gao2023retrieval, title={Retrieval-augmented generation for large language models: A survey}, author={Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Haofen}, journal={arXiv preprint arXiv:2312.10997}, year={2023} } Link: https://arxiv.org/abs/2312.10997

Table 1 is the summary of RAG methods. The columns are: Method Retrieval, Source Retrieval, DataType, Retrieval Granularity, Augmentation Stage, Retrieval process From this table, I learned:

There are many other interesting pieces of information about RAG in this paper, but those will be discussed in #26. The main focus here is to list the information related to chunk size.

carolina-museum commented 1 month ago

All our input will be in English text, for us to begin with. (The original notes include both English and Japanese.)

gomyway1216 commented 1 month ago

Thank you for reaching. I will prepare the sample data from my past notes.

carolina-museum commented 1 month ago

Prototype Project Settings Database input: English diary-like text in the length of a few paragraphs Query: A life decision question Output: Solution to the life decision question LLM: ChatGPT

To do in the near future

carolina-museum commented 1 month ago

How many tokens can we feed to our LLM model?: appx. 128k tokens

@article{finardi2024chronicles, title={The Chronicles of RAG: The Retriever, the Chunk and the Generator}, author={Finardi, Paulo and Avila, Leonardo and Castaldoni, Rodrigo and Gengo, Pedro and Larcher, Celio and Piau, Marcos and Costa, Pablo and Carid{\'a}, Vinicius}, journal={arXiv preprint arXiv:2401.07883}, year={2024} } Link: https://arxiv.org/pdf/2401.07883

Screenshot 2024-10-10 at 20 44 50
carolina-museum commented 1 month ago

Has anybody tried using different lengths of chunks?: Yes. Long chunks are not necessarily good. (This is exactly why we have RAG!)

@article{juvekar2024introducing, title={Introducing a new hyper-parameter for RAG: Context Window Utilization}, author={Juvekar, Kush and Purwar, Anupam}, journal={arXiv preprint arXiv:2407.19794}, year={2024} } Link: https://arxiv.org/pdf/2407.19794

The sources of texts are: Legal documents, research papers, Wikipedia pages

They tried chunk sizes of: 128, 256, 512, 1024, and 2048 --> 512 and 1024 were generally better than others.

carolina-museum commented 1 month ago

Just leaving my thoughts here: It seems like many of the online articles chunk the input in a larger number of tokens in the order of 1000s. I have a feeling that their whole input to the database is so much bigger than what we are expecting in our project. We might want smaller chunks than the ones on online articles. Also, we might want to depend more on the information that is not in the database. Anyways, we should try implementing first.