[Tracking]: Evaluation Integrations

Feature Description

[ ] Retrieval: https://github.com/jerryjliu/llama_index/issues/6554
[ ] RAG / Knowledge-Intensive QA: https://github.com/jerryjliu/llama_index/issues/6919
[ ] MMLU?
[ ] https://github.com/openai/evals?

TODO: investigate more RAG-specific benchmarks rather than retrieval-only or generation-only.

References / Areas of Exploration

Academic work on FLARE - RAG + reranking. Can get some inspiration from their evaluation methodology
Wizard of Wikipedia - Knowledge Augmented Dialogue. Not sure if there is any evaluation.
Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering
- CLiCR
Streaming QA.
- The nice thing about this dataset is that one can potentially segment out the dataset. But I'm not sure if this is possible to do (i.e. from t=t_1 to t=t_2) v.s. an aggregate over time from t=0. Otherwise the dataset is too massive (110GB compressed). Even the index is massive (400MB+). See issue: https://github.com/deepmind/streamingqa/issues/2
- Segment: https://data.statmt.org/news-crawl/en/news.2007.en.shuffled.deduped.gz ~ 170M. Still pretty large but probably some kind of segmenting can be done in realtime using a BTree / something like Polars DB.

Target Domains

Finance: FinQA - FinTabNet Corpus.
Codebase?? e.g. codequeries, CodeQA

TODO:

Compile all of the dataset sizes (we should prefer fine-tuning datasets with out-of-domain corpora rather than large and generic corpora which are likely representative of the LLM training datasets)

Hi, @jon-chuang! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you created discusses the evaluation of integrations for the project. You outlined various tasks to investigate, such as retrieval, RAG/Knowledge-Intensive QA, and MMLU. You also provided references and areas for further research. However, there hasn't been any activity on the issue since then.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your contribution, and we look forward to hearing from you soon!

Best regards, Dosu

run-llama / llama_index