Trouble understanding datasets used

lm-sys / RouteLLM

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!

Apache License 2.0

2.78k stars 204 forks source link

Trouble understanding datasets used #22

Closed aoezis closed 1 month ago

aoezis commented 1 month ago

Hello,

I have not clearly understood the format and source of the datasets used to train these routers. It's said to be published in huggingface. but for example I can't find the dataset that is used to train: routellm/mf_gpt4_augmented. As I understood from the code: train_matrix_factorization.py there has to be a json dataset with keys idx, model_a, model_b, winner. But there is no such dataset in the huggingface. Could you clarify the format and the creation of the dataset that is used for mf training?

aoezis commented 1 month ago

I'd like to train a similar router for a language other than English, And I'd like to know the exact format and where the data came from. Is it like 80k samples from chatbot-arena and 110k from nectar-gpt-4?

iojw commented 1 month ago

Hi there, you are correct! The GPT-4 augmented MF router was trained using ~60k battles from Chatbot Arena and ~110k from Nectar using a GPT-4 judge.

Chatbot Arena data is close to: https://huggingface.co/datasets/routellm/lmsys-arena-human-preference-55k-thresholds

Nectar dataset: https://huggingface.co/datasets/routellm/gpt4_judge_battles

aoezis commented 1 month ago

Alright, got it. thanks a lot.