rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM (eventually)
Apache License 2.0
128 stars 6 forks source link

Evaluation: RAG implementation #195

Open rmusser01 opened 2 weeks ago

rmusser01 commented 2 weeks ago

Issue is to track evaluation of RAG implementations.

Frameworks:

Papers:

One-Offs:

Test Data

rmusser01 commented 2 weeks ago

Unsorted

Evals https://arxiv.org/pdf/2407.03651 https://docs.smith.langchain.com/concepts/evaluation#evaluating-a-single-step-of-an-agent https://arxiv.org/html/2405.13622v1 https://thenewstack.io/openai-rag-vs-your-customized-rag-which-one-is-better/ https://huggingface.co/learn/cookbook/rag_evaluation https://github.com/explodinggradients/ragas https://codecompass00.substack.com/p/llm-evaluation-leaderboards https://docs.ragas.io/en/latest/index.html https://arxiv.org/abs/2309.01431 https://medium.com/@techsachin/ruler-benchmark-to-evaluate-long-context-modeling-capabilities-of-language-models-7eb13a269e36 https://medium.com/neuml/vector-search-rag-landscape-a-review-with-txtai-a7f37ad0e187 https://www.youtube.com/watch?v=Kp_AGKtql_U https://www.trulens.org/trulens_eval/evaluation/feedback_functions/ https://docs.ragas.io/en/latest/concepts/testset_generation.html https://github.com/Arize-ai/phoenix https://github.com/truera/trulens https://arxiv.org/abs/2407.01370v1 https://github.com/Marker-Inc-Korea/AutoRAG https://huggingface.co/learn/cookbook/en/rag_evaluation https://www.youtube.com/playlist?list=PLfaIDFEXuae0um8Fj0V4dHG37fGFU8Q5S https://arxiv.org/pdf/2408.02666v1 https://arxiv.org/abs/2407.18416

Evals 101 https://docs.smith.langchain.com/concepts/evaluation#evaluating-a-single-step-of-an-agent https://thenewstack.io/openai-rag-vs-your-customized-rag-which-one-is-better/ Benchmarks https://github.com/TIGER-AI-Lab/LongICLBench https://arxiv.org/html/2403.19889v1 https://arxiv.org/pdf/2407.03651 https://github.com/snorkel-ai/long-context-eval https://arxiv.org/html/2405.13622v1 - Tuning your search is better than a bigger model. https://arxiv.org/abs/2309.01431 https://huggingface.co/papers/2406.10149 https://github.com/booydar/babilong https://medium.com/@techsachin/ruler-benchmark-to-evaluate-long-context-modeling-capabilities-of-language-models-7eb13a269e36 Evals https://github.com/Arize-ai/phoenix https://github.com/truera/trulens https://github.com/Marker-Inc-Korea/AutoRAG https://huggingface.co/learn/cookbook/en/rag_evaluation https://www.youtube.com/playlist?list=PLfaIDFEXuae0um8Fj0V4dHG37fGFU8Q5S Ragas https://www.youtube.com/watch?v=fWC4VxolWAk https://www.analyticsvidhya.com/blog/2024/05/a-beginners-guide-to-evaluating-rag-pipelines-using-ragas/ https://github.com/explodinggradients/ragas https://docs.ragas.io/en/latest/index.html https://docs.ragas.io/en/latest/concepts/testset_generation.html LLM as a Judge https://cameronrwolfe.substack.com/p/llm-as-a-judge https://arxiv.org/abs/2407.10817 SummRAG https://github.com/ncsulsj/Robust_Summarization Test Data https://docs.ragas.io/en/latest/concepts/testset_generation.html https://www.turingpost.com/p/sytheticdata

rmusser01 commented 2 days ago

https://github.com/illuin-tech/grouse

rmusser01 commented 1 day ago

RAG Datasets https://huggingface.co/datasets/enelpol/rag-mini-bioasq https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia