improve search_embeddings() - Githubissues

mukulpatnaik / researchgpt

A LLM based research assistant that allows you to have a conversation with a research paper

https://www.dara.chat

MIT License

3.55k stars 340 forks source link

improve search_embeddings() #39

Closed goldengrape closed 1 year ago

goldengrape commented 1 year ago

Since this platform mainly consists of researchers involved in academic studies, we can discuss academic issues.

In the search_embeddings( ) function, only similarity sorting and selection of the top n=3 are done. Is there enough reason for this, or does it need to be improved?

I think it is possible to make a selection based on the distribution of similarity. We can select up to the highest point, and if it is a unimodal distribution, the top three should be the corpus near the peak. However, if the similarity distribution is not unimodal, like 2 peaks, selecting the top three directly may cause damage to the corpus, and it may be necessary to select the corpus near the peak and the surrounding area.

One possible method is to select the corpora based on the distribution of similarity. If the similarity distribution is unimodal, we can select the corpora near the peak of the similarity distribution, which should be the ones with the highest similarity. If the similarity distribution is not unimodal, we can select the corpus near the peak and its surrounding area to preserve the overall shape of similarity.

Selecting corpora based on similarity, the most complex (but maybe the best) approach may be to use a Transformer attention mechanism on sentence or paragraph level embedding-vectors to select appropriate reference corpora based on content.

MrPeterJin commented 1 year ago

Since this platform mainly consists of researchers involved in academic studies, we can discuss academic issues.

In the search_embeddings( ) function, only similarity sorting and selection of the top n=3 are done. Is there enough reason for this, or does it need to be improved?

I think it is possible to make a selection based on the distribution of similarity. We can select up to the highest point, and if it is a unimodal distribution, the top three should be the corpus near the peak. However, if the similarity distribution is not unimodal, like 2 peaks, selecting the top three directly may cause damage to the corpus, and it may be necessary to select the corpus near the peak and the surrounding area.

One possible method is to select the corpora based on the distribution of similarity. If the similarity distribution is unimodal, we can select the corpora near the peak of the similarity distribution, which should be the ones with the highest similarity. If the similarity distribution is not unimodal, we can select the corpus near the peak and its surrounding area to preserve the overall shape of similarity.

Selecting corpora based on similarity, the most complex (but maybe the best) approach may be to use a Transformer attention mechanism on sentence or paragraph level embedding-vectors to select appropriate reference corpora based on content.

I agree with you. However, I think the computational cost is another factor we need to focus. For this, the most simplistic approach I am considering is to add the GPTIndex for a better extraction in my fork.

MrPeterJin commented 1 year ago

Since this platform mainly consists of researchers involved in academic studies, we can discuss academic issues. In the search_embeddings( ) function, only similarity sorting and selection of the top n=3 are done. Is there enough reason for this, or does it need to be improved? I think it is possible to make a selection based on the distribution of similarity. We can select up to the highest point, and if it is a unimodal distribution, the top three should be the corpus near the peak. However, if the similarity distribution is not unimodal, like 2 peaks, selecting the top three directly may cause damage to the corpus, and it may be necessary to select the corpus near the peak and the surrounding area. One possible method is to select the corpora based on the distribution of similarity. If the similarity distribution is unimodal, we can select the corpora near the peak of the similarity distribution, which should be the ones with the highest similarity. If the similarity distribution is not unimodal, we can select the corpus near the peak and its surrounding area to preserve the overall shape of similarity. Selecting corpora based on similarity, the most complex (but maybe the best) approach may be to use a Transformer attention mechanism on sentence or paragraph level embedding-vectors to select appropriate reference corpora based on content.

I agree with you. However, I think the computational cost is another factor we need to focus. For this, the most simplistic approach I am considering is to add the GPTIndex for a better extraction in my fork.

Update: directly use GPTIndex has not outperformed the method of feeding embeddings with GPT-3/3.5.

goldengrape commented 1 year ago

To maximize the use of each conversation, the number of RESULTS can be dynamically selected. As long as the total length does not exceed a threshold, they can be included as references. ChatGPT will decide for itself whether these references are useful

length_max=3000
results = df.sort_values("similarity", ascending=False, ignore_index=True)
# make a dictionary of the the first N results with the page number as the key and the text as the value. The page number is a column in the dataframe.
# Calculate the cumulative sum of df["length"].
# When the cumulative sum > length_max, stop the accumulation and use the result of the accumulation as the value of n
results['length_cumsum'] = results['length'].cumsum()
results = results[results['length_cumsum'] <= length_max]
n=len(results)

MrPeterJin commented 1 year ago

To maximize the use of each conversation, the number of RESULTS can be dynamically selected. As long as the total length does not exceed a threshold, they can be included as references. ChatGPT will decide for itself whether these references are useful
length_max=3000
results = df.sort_values("similarity", ascending=False, ignore_index=True)
# make a dictionary of the the first N results with the page number as the key and the text as the value. The page number is a column in the dataframe.
# Calculate the cumulative sum of df["length"].
# When the cumulative sum > length_max, stop the accumulation and use the result of the accumulation as the value of n
results['length_cumsum'] = results['length'].cumsum()
results = results[results['length_cumsum'] <= length_max]
n=len(results)

Hi, I just implement this in my fork. Also. I split the main text and the reference to get a better search scope.

MrPeterJin commented 1 year ago

To maximize the use of each conversation, the number of RESULTS can be dynamically selected. As long as the total length does not exceed a threshold, they can be included as references. ChatGPT will decide for itself whether these references are useful
length_max=3000
results = df.sort_values("similarity", ascending=False, ignore_index=True)
# make a dictionary of the the first N results with the page number as the key and the text as the value. The page number is a column in the dataframe.
# Calculate the cumulative sum of df["length"].
# When the cumulative sum > length_max, stop the accumulation and use the result of the accumulation as the value of n
results['length_cumsum'] = results['length'].cumsum()
results = results[results['length_cumsum'] <= length_max]
n=len(results)
Hi, I just implement this in my fork. Also. I split the main text and the reference to get a better search scope (Although this splitting is a bit of naive).

goldengrape commented 1 year ago

@MrPeterJin If the PDF file is large, it can easily cause errors. I'm don't know why. Also, since your fork is mainly running locally, can you directly save df.to_pickle as a pickel file with the same name as the PDF? This way, next time I read the same PDF, I can skip the embedding process. @mukulpatnaik used redis, but I don't know where it save the db

MrPeterJin commented 1 year ago

@MrPeterJin If the PDF file is large, it can easily cause errors. I'm don't know why. Also, since your fork is mainly running locally, can you directly save df.to_pickle as a pickel file with the same name as the PDF? This way, next time I read the same PDF, I can skip the embedding process. @mukulpatnaik used redis, but I don't know where it save the db

For errors caused by large documents, it does have a high probability to yield errors. I think a paper has this amount of data would have various layouts (For example, you may refer to pages 21-22 of this paper, its main part is nearly only images), which is not able for handling by my fork.

goldengrape commented 1 year ago

@MrPeterJin In paper-qa https://github.com/whitead/paper-qa/blob/main/paperqa/readers.py He use "pypdf", will it be better than pdfplumber?

MrPeterJin commented 1 year ago

@MrPeterJin If the PDF file is large, it can easily cause errors. I'm don't know why. Also, since your fork is mainly running locally, can you directly save df.to_pickle as a pickel file with the same name as the PDF? This way, next time I read the same PDF, I can skip the embedding process. @mukulpatnaik used redis, but I don't know where it save the db

@goldengrape I have implemented saving embeddings as pickle file. please check local.py for reference.

As for PyPDF, it cannot read pdfs whose language is other than English. That's why I switched to pdfplumber.

goldengrape commented 1 year ago

@MrPeterJin or PyMuPDF? https://github.com/pymupdf/PyMuPDF/issues/329

MrPeterJin commented 1 year ago

@MrPeterJin or PyMuPDF? pymupdf/PyMuPDF#329

Okay... Seems pdfplumber also supported reading images, so I consider refactoring my code to overcome this issue. Since change the pdf library would cost a lot more work :p

goldengrape commented 1 year ago

Maybe that's why ChatPDF.com totally give up to display pdf

MrPeterJin commented 1 year ago

@MrPeterJin If the PDF file is large, it can easily cause errors. I'm don't know why. Also, since your fork is mainly running locally, can you directly save df.to_pickle as a pickel file with the same name as the PDF? This way, next time I read the same PDF, I can skip the embedding process. @mukulpatnaik used redis, but I don't know where it save the db

For errors caused by large documents, it does have a high probability to yield errors. I think a paper has this amount of data would have various layouts (For example, you may refer to pages 21-22 of this paper, its main part is nearly only images), which is not able for handling by my fork.

@goldengrape You may test my new fork. I fixed the errors in this long paper. Maybe it would work on your long papers?

Edited: it is quite robust (at least robust than the previous version) now. You may try on it.

goldengrape commented 1 year ago

@MrPeterJin I understand the reason for my previous failure now. It's because I was testing with 'books' instead of 'research papers'. For books, the formatting may be more complex and errors can occur with book titles. Now that I've limited the PDFs to research papers, there are no issues.

For research papers, there are too many implicit assumptions about the format. Therefore, you can estimate whether it is a title by its position. Of course, limiting the use to a specific field is a good implementation method.

But I suggest a more "ChatGPT" way, for example, you can extract the content of the first page or the first N sentences and directly ask ChatGPT which sentence is the title. Similarly, it is also possible to use a prompt to determine whether a piece of text is a reference.