Give users the ability to compare models' outputs for a given task.

huggingface / hub-docs

Docs of the Hugging Face Hub

Apache License 2.0

271 stars 233 forks source link

🗒️ Motivation

When a user selects a specific task on the Hugging Face Hub - for example, image-to-text:

That user is shown a series of models, with no guidance as to which model might be state of the art, or which might be the most performant for their use case.

To test the capabilities and behavior of each model, the user must:

Open the link to each model in a new browser tab.
Read through each model's model card.
Test out each model with an image, if the model has a Spaceor a Colab notebook available (not every model does).
Cross-reference each model with the state-of-the-art leaderboards on Papers with Code.

🙏 Desired Behavior

The user should be able to:

Select a given task (for example, image-to-text).
Select one or more models to test, for that use case.
Input a piece of data, to test (an image, in the case of image-to-text).
View the output of each model, given that input data -- side by side, to compare the performance and behavior.

This would be so cool, I really like the user story you made with the side-by-side benchmark!

We were talking about integrating something like this but into tasks pages, so the workflow would be:

Select a task
Read a bit about the task and discover some SOTA models (we tried to editorialize that a bit by writing an explanatory text about each task and associating a note to each hand-curated model).
Compute multiple widgets with a single input to compare the outputs (today you only get one curated model to test the task).

And maybe when you select a particular task on /models we could add a link to the task page:

This is probably the simplest way of doing it but I understand that it's not the same as having it directly integrated into the /models page.

So maybe we want to go further and do it exactly like you said: integrate it directly into the /models page: you drag/input a picture/audio/text and all the visible models on the page compute and it switches to "benchmark mode" (that could be gamechanger 🤯). That will of course be a lot of work and I'm not even sure that we can hold that many computations at the same time 👀 (edit: we will find a way 👍 ).

huggingface / hub-docs

Give users the ability to compare models' outputs for a given task. #56

🗒️ Motivation

🙏 Desired Behavior