huggingface / hub-docs

Docs of the Hugging Face Hub
http://hf.co/docs/hub
Apache License 2.0
302 stars 253 forks source link

Exploratory Analysis of Models on Hub #61

Open lewtun opened 3 years ago

lewtun commented 3 years ago

Using the huggingface_hub library, I was able to collect some statistics on the 9,984 models that are currently hosted on the Hub. The main goal of this exercise was to find answers to the following questions:

Number of models per dimension

Without applying any filters on the architecture names, the number of models per criterion is shown in the table below:

Has architecture Has dataset Has metric Number of models
8129
1241
359

These numbers include models for which a task may not be easily inferred from the architecture alone. For example BertModel would presumably be associated with a feature-extraction task, but these are not simple to evaluate.

By applying a filter on the architecture name to contain any of "For", "MarianMTModel" (translation), or "LMHeadModel" (language modelling), we arrive at the following table:

Has task Has dataset Has metric Number of models
7452
1150
337

Architecture frequencies

Some models either have no architecture (e.g. the info is missing from the config.json file or the model belongs to another library like Flair), or multiple ones:

Number of architectures Number of models
0 1755
1 8125
2 1
3 3

Based on these counts, it thus makes sense to just focus on models with a single architecture.

Number of models per task

For models with a single architecture, I extract the task names from the architecture name according to the following mappings:

The resulting frequency counts are shown below:

LanguageModeling                    3250
Translation                         1354
SequenceClassification               829
ConditionalGeneration                766
Model                                655
QuestionAnswering                    364
CTC                                  318
TokenClassification                  286
PreTraining                          163
MultipleChoice                        37
MultiLabelSequenceClassification      17
ImageClassification                   15
MultiLabelClassification              11
Generation                             7
ImageClassificationWithTeacher         4

Fun stuff

We can visualise which tasks are connected to which datasets as a graph. Here we show the top 10 tasks (measured by node connectivity) with the top 20 datasets marked in orange

tasks2datasets

lewtun commented 3 years ago

thanks to a tip from @osanseviero and @julien-c, i can improve the analysis by making use of ModelInfo.pipeline_tag to infer the tasks. i'll update the analysis with this improved mapping

lewtun commented 3 years ago

Using the ModelInfo approach suggested by @osanseviero and @julien-c makes the analysis much simpler :)

Breakdown by task

First, the pipeline_tag already contains the task information and provides a more realistic grouping of the model architectures:

pipeline_tag        number_of_models
====================================
unknown                         2394
text-generation                 2286
translation                     1373
fill-mask                        958
text-classification              860
text2text-generation             748
question-answering               368
automatic-speech-recognition     329
token-classification             324
summarization                    228
conversational                    32
image-classification              22
audio-source-separation           19
table-question-answering          19
text-to-speech                    17
zero-shot-classification          17
feature-extraction                 8
object-detection                   5
voice-activity-detection           3
image-segmentation                 3
Semantic Similarity                2
sentence-similarity                2
We can see there are two similar-looking tasks: Semantic Similarity and sentence-similarity. By looking at the corresponding model IDs in the table below, we can see that they appear to be models produced using sentence-transformers model_id
Sahajtomar/french_semantic
Sahajtomar/sts-GBERT-de
osanseviero/full-sentence-distillroberta2
osanseviero/full-sentence-distillroberta3

Suggestion: rename Sentence Similarity to sentence-similarity to match the naming convention of pipeline tags

Drilling down on the unknown pipeline tags

We can see 2,394 models are currently missing a pipeline tag, which is about 24% of all the models currently on the Hub:

has_pipeline_tag num_models
True 7623
False 2394

Of the models without a pipeline tag, we can drill down further by asking how many of them have a config.json file:

Screen Shot 2021-06-04 at 11 28 19 am

Interestingly, the list of model IDs with a pipeline tag but no config.json file includes models like distilbert-base-uncased, for which an architecture field probably did not exist when this model was trained.

A list of the model IDs is attached:

2021-06-04_models-without-pipeline-tag-with-config.csv

Code snippet to pull metadata

import pandas as pd
from huggingface_hub import HfApi

def get_model_metadata():

    all_models = HfApi().list_models(full=True)
    metadata = []

    for model in all_models:
        has_readme = False
        has_config = False
        has_pipeline_tag = False
        pipeline_tag = "unknown"

        if model.pipeline_tag:
            pipeline_tag = model.pipeline_tag
            has_pipeline_tag = True

        for sibling in model.siblings:
            if sibling.rfilename == "README.md":
                has_readme = True
            if sibling.rfilename == "config.json":
                has_config = True

        metadata.append(
            (
                model.modelId,
                pipeline_tag,
                model.tags,
                has_pipeline_tag,
                has_config,
                has_readme,
            )
        )

    df = pd.DataFrame(
        metadata,
        columns=[
            "model_id",
            "pipeline_tag",
            "tags",
            "has_pipeline_tag",
            "has_config",
            "has_readme",
        ],
    )
    return df
osanseviero commented 3 years ago

Thanks for the analysis!

re: Semantic similarity. The user overrode the pipeline tag in the METADATA some months ago. I agree that they should be sentence-similarity, which is a fairly recent task.

Some brainstorm ideas for further analysis:

re: If I understand correctly, we have 2394 repos without a pipeline tag and we might want to put some effort on those. From those repos:

julien-c commented 3 years ago

we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.

Lots of model repos are empty (or WIPs) so I wouldn't aim to classify all models.

julien-c commented 3 years ago

PS/ fixed the metadata for one model in https://huggingface.co/Sahajtomar/french_semantic/commit/2392beb954ae32dafa587e03f278a0158d1da7b5

julien-c commented 3 years ago

and the other in https://huggingface.co/Sahajtomar/sts-GBERT-de/commit/935c5217fd8f03de0ccd9e6e3f34e21651573e84

julien-c commented 3 years ago

Finally, cc'ing model author @Sahajtomar for visibility. Let us know if any issue 🙂

osanseviero commented 3 years ago

we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.

Yes, I was suggesting that maybe we should make it more explicit and add the transformers tag to those. As we intend to expand our usage to more libraries, longer-term I think we should reduce the magic that happens in our side, and have transformers as an explicit tag. (related PR https://github.com/huggingface/moon-landing/pull/746)

julien-c commented 3 years ago

Yes agreed, probably not short term but when we start adding more validation to the yaml block in models we can 1/ add this rule 2/ update all updateable models on the hub

lewtun commented 3 years ago

while checking for suitable datasets for model evaluation, i discovered several models have typos / non-conventional naming for the datasets: tag.

using some fuzzy string matching i compiled a list of (model, dataset, closest_dataset_match) where the closest match to a canonical dataset in datasets was based (arbitrarily) on whether the levenstein distance is > 85.

i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?

non-canonical-datasets.csv

julien-c commented 3 years ago

i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?

Yes 👍

Also check out https://observablehq.com/@huggingface/kaggle-dataset-huggingface-modelhub from @severo which looks great (cc @gary149)