Exploratory Analysis of Models on Hub

lewtun commented 3 years ago

Using the huggingface_hub library, I was able to collect some statistics on the 9,984 models that are currently hosted on the Hub. The main goal of this exercise was to find answers to the following questions:

How many model architectures can be mapped to tasks that we wish to evaluate? For example a model with the BertForSequenceClassification architecture is likely to be about text classification; similarly for the other ModelNameForXxx architectures.
How many models have an architecture, dataset, and metric in their metadata?
Which tasks are most common?

Number of models per dimension

Without applying any filters on the architecture names, the number of models per criterion is shown in the table below:

Has architecture	Has dataset	Has metric	Number of models
✅	❌	❌	8129
✅	✅	❌	1241
✅	✅	✅	359

These numbers include models for which a task may not be easily inferred from the architecture alone. For example BertModel would presumably be associated with a feature-extraction task, but these are not simple to evaluate.

By applying a filter on the architecture name to contain any of "For", "MarianMTModel" (translation), or "LMHeadModel" (language modelling), we arrive at the following table:

Has task	Has dataset	Has metric	Number of models
✅	❌	❌	7452
✅	✅	❌	1150
✅	✅	✅	337

Architecture frequencies

Some models either have no architecture (e.g. the info is missing from the config.json file or the model belongs to another library like Flair), or multiple ones:

Number of architectures	Number of models
0	1755
1	8125
2	1
3	3

Based on these counts, it thus makes sense to just focus on models with a single architecture.

Number of models per task

For models with a single architecture, I extract the task names from the architecture name according to the following mappings:

"MarianMTModel" => "Translation"
architectures containing "LMHeadModel", "LMHead", "MaskedLM", "CausalLM" => "LanguageModeling"
architectures containing "Model", "DPR", "Encoder" => "Model"

The resulting frequency counts are shown below:

LanguageModeling                    3250
Translation                         1354
SequenceClassification               829
ConditionalGeneration                766
Model                                655
QuestionAnswering                    364
CTC                                  318
TokenClassification                  286
PreTraining                          163
MultipleChoice                        37
MultiLabelSequenceClassification      17
ImageClassification                   15
MultiLabelClassification              11
Generation                             7
ImageClassificationWithTeacher         4

Fun stuff

We can visualise which tasks are connected to which datasets as a graph. Here we show the top 10 tasks (measured by node connectivity) with the top 20 datasets marked in orange

tasks2datasets

lewtun commented 3 years ago

thanks to a tip from @osanseviero and @julien-c, i can improve the analysis by making use of ModelInfo.pipeline_tag to infer the tasks. i'll update the analysis with this improved mapping

lewtun commented 3 years ago

Using the ModelInfo approach suggested by @osanseviero and @julien-c makes the analysis much simpler :)

Breakdown by task

First, the pipeline_tag already contains the task information and provides a more realistic grouping of the model architectures:

pipeline_tag        number_of_models
====================================
unknown                         2394
text-generation                 2286
translation                     1373
fill-mask                        958
text-classification              860
text2text-generation             748
question-answering               368
automatic-speech-recognition     329
token-classification             324
summarization                    228
conversational                    32
image-classification              22
audio-source-separation           19
table-question-answering          19
text-to-speech                    17
zero-shot-classification          17
feature-extraction                 8
object-detection                   5
voice-activity-detection           3
image-segmentation                 3
Semantic Similarity                2
sentence-similarity                2

We can see there are two similar-looking tasks: `Semantic Similarity` and `sentence-similarity`. By looking at the corresponding model IDs in the table below, we can see that they appear to be models produced using `sentence-transformers`	model_id
Sahajtomar/french_semantic
Sahajtomar/sts-GBERT-de
osanseviero/full-sentence-distillroberta2
osanseviero/full-sentence-distillroberta3

Suggestion: rename Sentence Similarity to sentence-similarity to match the naming convention of pipeline tags

Drilling down on the `unknown` pipeline tags

We can see 2,394 models are currently missing a pipeline tag, which is about 24% of all the models currently on the Hub:

has_pipeline_tag	num_models
True	7623
False	2394

Of the models without a pipeline tag, we can drill down further by asking how many of them have a config.json file:

Interestingly, the list of model IDs with a pipeline tag but no config.json file includes models like distilbert-base-uncased, for which an architecture field probably did not exist when this model was trained.

A list of the model IDs is attached:

2021-06-04_models-without-pipeline-tag-with-config.csv

Code snippet to pull metadata

import pandas as pd
from huggingface_hub import HfApi

def get_model_metadata():

    all_models = HfApi().list_models(full=True)
    metadata = []

    for model in all_models:
        has_readme = False
        has_config = False
        has_pipeline_tag = False
        pipeline_tag = "unknown"

        if model.pipeline_tag:
            pipeline_tag = model.pipeline_tag
            has_pipeline_tag = True

        for sibling in model.siblings:
            if sibling.rfilename == "README.md":
                has_readme = True
            if sibling.rfilename == "config.json":
                has_config = True

        metadata.append(
            (
                model.modelId,
                pipeline_tag,
                model.tags,
                has_pipeline_tag,
                has_config,
                has_readme,
            )
        )

    df = pd.DataFrame(
        metadata,
        columns=[
            "model_id",
            "pipeline_tag",
            "tags",
            "has_pipeline_tag",
            "has_config",
            "has_readme",
        ],
    )
    return df

osanseviero commented 3 years ago

Thanks for the analysis!

re: Semantic similarity. The user overrode the pipeline tag in the METADATA some months ago. I agree that they should be sentence-similarity, which is a fairly recent task.

Some brainstorm ideas for further analysis:

re: If I understand correctly, we have 2394 repos without a pipeline tag and we might want to put some effort on those. From those repos:

1253 have a config.json. Is there anything we can obtain from the config to understand what they are? (unrelated: I think we assume that these are Transformer based, so maybe it also makes sense to add transformers tag, which we currently don't use at the moment.)
Not necessarily useful, but it might also be worth classifying the ones without a config and without a pipeline tag. We could check if 1. the repos are empty; 2 the repo has a file that could correspond to another library; etc. This might be more challenging though.

julien-c commented 3 years ago

we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.

Lots of model repos are empty (or WIPs) so I wouldn't aim to classify all models.

julien-c commented 3 years ago

PS/ fixed the metadata for one model in https://huggingface.co/Sahajtomar/french_semantic/commit/2392beb954ae32dafa587e03f278a0158d1da7b5

julien-c commented 3 years ago

and the other in https://huggingface.co/Sahajtomar/sts-GBERT-de/commit/935c5217fd8f03de0ccd9e6e3f34e21651573e84

pytorch unneeded as it's inferred from the files
task name can be in tags and will be inferred as the pipeline type
Fixed the library name (can also be in tags)

julien-c commented 3 years ago

Finally, cc'ing model author @Sahajtomar for visibility. Let us know if any issue 🙂

osanseviero commented 3 years ago

we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.

Yes, I was suggesting that maybe we should make it more explicit and add the transformers tag to those. As we intend to expand our usage to more libraries, longer-term I think we should reduce the magic that happens in our side, and have transformers as an explicit tag. (related PR https://github.com/huggingface/moon-landing/pull/746)

julien-c commented 3 years ago

Yes agreed, probably not short term but when we start adding more validation to the yaml block in models we can 1/ add this rule 2/ update all updateable models on the hub

lewtun commented 3 years ago

while checking for suitable datasets for model evaluation, i discovered several models have typos / non-conventional naming for the datasets: tag.

using some fuzzy string matching i compiled a list of (model, dataset, closest_dataset_match) where the closest match to a canonical dataset in datasets was based (arbitrarily) on whether the levenstein distance is > 85.

i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?

non-canonical-datasets.csv

julien-c commented 3 years ago

i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?

Yes 👍

Also check out https://observablehq.com/@huggingface/kaggle-dataset-huggingface-modelhub from @severo which looks great (cc @gary149)

huggingface / hub-docs