nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
70.68k stars 7.7k forks source link

Hybrid search #2969

Closed manyoso closed 1 month ago

manyoso commented 1 month ago

The first three commits are not strictly necessary for hybrid search. The first one is important though as we should maintain the same order of results that the embedding search returns. Even this has an impact on beir test results.

The second and third commits are about addressing a problem in our current chunking strategy where the maximum chunk size is not strictly enforced. These two changes enforce a strict maximum chunk size while not changing anything else about our chunking strategy.

The fourth commit is the actual hybrid search. It introduces the fts virtual table and implements reciprocal rank fusion to combine bm25 keyword search with the embedding search.

RRF paper: https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

Beir dataset paper: https://arxiv.org/pdf/2306.07471

The following changes improve our performance across four beir datasets. Future changes will be forthcoming integrating the test harness used to assess and make these changes. For now, here are screenshots showing some of the results:

image

image

I also tested k=3 with 512 chunk size which matches our localdocs defaults and the numbers again showed improvements for hybrid search.

The one dataset that doesn't show clear improvement at 512 chunk size is fiqa, but it does show improvement with document sized chunks. Still researching how to improve performance on this one.

Also: I'm considering adding a configuration option to turn on/off hybrid search, but I think this is good to go in now.