searchmysite / searchmysite.net

searchmysite.net is an open source search engine and search as a service
GNU Affero General Public License v3.0
75 stars 7 forks source link

Chat with your website functionality #96

Open m-i-l opened 1 year ago

m-i-l commented 1 year ago

As per comments in #85 and #84 I'd like to experiment with Solr 9's new vector search (DenseVectorField fieldType and K-Nearest-Neighbor Query Parser).

Vector search works best on longer phrases, while keyword search works best on specific search terms, and I'm not sure how best to combine the two (I know there are various proposals for hybrid search models but not sure there are any best practices yet), so simplest option for now is a separate vector search page. Given the longer phrase input, and the familiarity many people have with things like ChatGPT, it would make sense to have a chat-like interface.

This could be accessed via a new link below the main search box, to the left of "Browse Sites", called e.g. "Chat Search". This would take you to a page with a larger chat box, where you can (in effect) ask a question about content in the searchmysite.net index, and get a link back, and maybe even a summary of the relevant part of the page.

A quick rough estimate suggests I could use a paid-for Large Language Model (LLM) APIs like OpenAI for content embeddings for about US$25 a month, which would probably be doable, but the issue is that it would need matching query embeddings and also potentially summarisation API calls, which could work out at up to US$0.05 per question, and given I can have over 160,000 searches by (unblockable) SEO spam bots per day, I don't want the financial risk of using a paid-for API. That means I'll need to use some open source language models that are self-hostable on the relatively low spec hardware I'm currently using (2 vCPUs and 4Gb RAM).

Results therefore won't be anywhere near as good as ChatGPT, but hopefully people will understand that I don't have unlimited cash. The main benefit is that the work might encourage more interest in the project. Plus it could form the basis for something a lot better, given there's lots of projects to get some of the larger models running on consumer hardware, e.g. the float16 to int8 with LLaMA, LoRA etc.

m-i-l commented 1 year ago

I've experimented with https://github.com/imartinez/privateGPT , and saw there is a possibility of feeding the returned embeddings from a vector search as context into a self-hosted GPT4All model to return answers to questions. Challenges would include:

Planning to implement the vector search first, before starting on the question answering. Hence I've created a new #99 for vector search implementation, and renamed this to Chat with your website functionality.

m-i-l commented 10 months ago

Now the #99 vector search implementation is working, and I've upgraded the server from 4Gb to 8Gb RAM as per #110, I've started taking a look at this again.

I have a basic Retrieval Augmented Generation working on dev, using the 7Bn parameter LLama 2 chat model quantized down to 3 bit, which is about as low as you can go. On my 7 year old dev machine, context fragments were returned nearly instantly and the generated results were taking 30-60s to generate, which is not necessarily too slow. It takes around 4.8Gb of RAM, which will be a bit of a struggle to fit on the production server, but not necessarily out of the question either. Results were at times surprisingly good, e.g.

Question: How long does it take to climb ben nevis? Answer (context): https://michael-lewis.com/posts/climbing-the-three-peaks-snowdon-scafell-pike-and-ben-nevis/ Answer (generated): Based on the context you provided, it takes nearly 4 hours to climb Ben Nevis from the visitor center. The blog post states that it took them 4 hours to reach the summit, which is significantly longer than the time it would take to climb other mountains like Snowdon or Scafell Pike.

Getting a simple demo working on dev is one thing, but getting it production ready (e.g. able to work with more than one user at a time, integrating into existing non-async Flask templates, etc.) is something else entirely.

m-i-l commented 9 months ago

I've deployed a version to production for early testing, although haven't put a link to it anywhere because it isn't ready for wider testing just yet.

It is using llama-2-7b-chat quantised down to 3bit, with TorchServe as the model server.

I've written a post at https://michael-lewis.com/posts/vector-search-and-retrieval-augmented-generation/ with more information on LLMs, Retrieval Augmented Generation, TorchServe etc.

Will update further after testing, and if all goes well will open up for wider use.

m-i-l commented 9 months ago

I've swapped from the 7B parameter LLama 2 chat model quantised down to 3 bit, because that was too slow, so now I've swapped to the 3B parameter Rocket model quantised down to 4 bit.

In summary, the source reference link is returned super quickly, some of the generated content is excellent, from a memory perspective it looks viable, and from a CPU and overall response time perspective it might be viable but need further testing especially when the indexing is running. The main issue now is that the vector search results are quite poor so the LLM is given poor context, which means it mostly can't answer the question even though it should have been able to - workaround for now is to restrict to the site you are interested in querying via the domains selector below the Ask a question box.

To get this far, I've encountered and resolved (or partially resolved) the following issues:

Open issues are:

m-i-l commented 9 months ago

Regarding the surprisingly poor quality results, with sentence-transformers/all-MiniLM-L6-v2, “How high is Ben Nevis?” gives a similarity score of 0.3176 to text about mountains containing the words “Ben Nevis” and its height, but a higher score of 0.4072 to some text about someone called Benjamin talking about someone down a well, and “Can you summarize Immanuel Kant’s biography in two sentences?” gives a similarity score of 0.5178 to text containing “Immanuel Kant” and some details of his life, but a higher score of 0.5766 to just the word “Biography" - you can test via:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
question1 = "How high is Ben Nevis?"
answers1 = ["The three peaks in this context are the three highest peaks in Great Britain: Scafell Pike, England, 978m; Snowdon (Yr Wyddfa in Welsh), Wales, 1085m; Ben Nevis (Bheinn Nibheis in Scottish Gaelic), Scotland, 1345m", "Imagine being all that way down in the dark. Hope they thought to haul him up again at the end opined Benjamin, pleasantly."]
util.cos_sim(model.encode(question1), model.encode(answers1[0]))
util.cos_sim(model.encode(question1), model.encode(answers1[1]))
question2 = "Can you summarize Immanuel Kant's biography in two sentences?"
answers2 = ["Biography", "Immanuel Kant, born in 1724, was one of the most influential philosophers of the Enlightenment. Although Kant is best known today as a philosopher, his early work focused on physics. He correctly deduced a number of complicated physical phenomena, including the orbital mechanics of the earth and moon, the effects of the earth\u2019s rotation on weather patterns, and how the solar system was formed."]
util.cos_sim(model.encode(question2), model.encode(answers2[0]))
util.cos_sim(model.encode(question2), model.encode(answers2[1]))

I've tested some of the alternative models on the leaderboard at https://huggingface.co/spaces/mteb/leaderboard), and switched to BAAI/bge-small-en-v1.5 because it gives better results (including the expected ones in the examples above) and doesn't take much more memory or CPU.

It'll take 7 days for all the full listings to be reindexed with the new embedding model, and 28 days for all of the basic listings to be reindexed, so it should be ready for testing on production in around 7 days.