Open ndbolligerD3 opened 1 week ago
Most of our use cases require factual accuracy.
Example Use Case:
We need to fine-tune at least the following parameters: top-k, chunk size, and chunk overlap. Can chunk size correspond to each section of the research papers (current Pinecone strategy)? Can we use full text or stuffing when the context window allows for it? (Current Streamlit setup) Do we need a test set to fine tune these parameters? We now have a list of verified and hallucinated facts.
Aside from tuning we can implement pre-processing, if most of our use cases focus on facts in research why can't we use a deterministic model to identify and pull out facts. The facts can be embedded into response as-is, and a generative model can generate the editorial framing around the facts (e.g. social post, web article, etc.). That way we eliminate the need to constantly check if the facts remain correct if we just pull the facts as is from the papers.
We can also embed evaluation methods like G-Eval, where another LLM is checking for accuracy (https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization) (https://arxiv.org/pdf/2303.16634)
Or use a non-LLM approach like Amazon’s QUALS or other evaluations for factual accuracy (https://github.com/amazon-science/fact-check-summarization)
We updated: enabled OCR and changed Top k to 40. We used the "Generative AI and the Nature of Work" paper and it still hallucinated 3 quotes.
This ticket is to have a conversation between D3 and AM.
Questions: Do we properly tune the model? How should we chunk the papers? Or do we stuff the full article? Do we need a test set for tuning the model? How do we balance out the configuration needed for a single file and a group of files? Do we need to trigger different configurations for different needs?