Open WebUI still hallucinating quotes

Most of our use cases require factual accuracy.

Example Use Case:

Create a knowledge collection of all research papers on Gen AI
Use chat interface to produce research-backed insights for different industries
The result is highly inaccurate right now. Out of 111 facts, 8 were correct and factual, 26 had to be rewritten or adjusted, and the rest included false metrics.

We need to fine-tune at least the following parameters: top-k, chunk size, and chunk overlap. Can chunk size correspond to each section of the research papers (current Pinecone strategy)? Can we use full text or stuffing when the context window allows for it? (Current Streamlit setup) Do we need a test set to fine tune these parameters? We now have a list of verified and hallucinated facts.

Aside from tuning we can implement pre-processing, if most of our use cases focus on facts in research why can't we use a deterministic model to identify and pull out facts. The facts can be embedded into response as-is, and a generative model can generate the editorial framing around the facts (e.g. social post, web article, etc.). That way we eliminate the need to constantly check if the facts remain correct if we just pull the facts as is from the papers.

We can also embed evaluation methods like G-Eval, where another LLM is checking for accuracy (https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization) (https://arxiv.org/pdf/2303.16634)

Or use a non-LLM approach like Amazon’s QUALS or other evaluations for factual accuracy (https://github.com/amazon-science/fact-check-summarization)

harvard-hbs-d3 / d3-open-webui

Open WebUI still hallucinating quotes #1