Open lewtun opened 3 years ago
thanks to a tip from @osanseviero and @julien-c, i can improve the analysis by making use of ModelInfo.pipeline_tag
to infer the tasks. i'll update the analysis with this improved mapping
Using the ModelInfo
approach suggested by @osanseviero and @julien-c makes the analysis much simpler :)
First, the pipeline_tag
already contains the task information and provides a more realistic grouping of the model architectures:
pipeline_tag number_of_models
====================================
unknown 2394
text-generation 2286
translation 1373
fill-mask 958
text-classification 860
text2text-generation 748
question-answering 368
automatic-speech-recognition 329
token-classification 324
summarization 228
conversational 32
image-classification 22
audio-source-separation 19
table-question-answering 19
text-to-speech 17
zero-shot-classification 17
feature-extraction 8
object-detection 5
voice-activity-detection 3
image-segmentation 3
Semantic Similarity 2
sentence-similarity 2
We can see there are two similar-looking tasks: Semantic Similarity and sentence-similarity . By looking at the corresponding model IDs in the table below, we can see that they appear to be models produced using sentence-transformers |
model_id |
---|---|
Sahajtomar/french_semantic | |
Sahajtomar/sts-GBERT-de | |
osanseviero/full-sentence-distillroberta2 | |
osanseviero/full-sentence-distillroberta3 |
Suggestion: rename Sentence Similarity
to sentence-similarity
to match the naming convention of pipeline tags
unknown
pipeline tagsWe can see 2,394 models are currently missing a pipeline tag, which is about 24% of all the models currently on the Hub:
has_pipeline_tag | num_models |
---|---|
True | 7623 |
False | 2394 |
Of the models without a pipeline tag, we can drill down further by asking how many of them have a config.json
file:
Interestingly, the list of model IDs with a pipeline tag but no config.json
file includes models like distilbert-base-uncased
, for which an architecture
field probably did not exist when this model was trained.
A list of the model IDs is attached:
2021-06-04_models-without-pipeline-tag-with-config.csv
import pandas as pd
from huggingface_hub import HfApi
def get_model_metadata():
all_models = HfApi().list_models(full=True)
metadata = []
for model in all_models:
has_readme = False
has_config = False
has_pipeline_tag = False
pipeline_tag = "unknown"
if model.pipeline_tag:
pipeline_tag = model.pipeline_tag
has_pipeline_tag = True
for sibling in model.siblings:
if sibling.rfilename == "README.md":
has_readme = True
if sibling.rfilename == "config.json":
has_config = True
metadata.append(
(
model.modelId,
pipeline_tag,
model.tags,
has_pipeline_tag,
has_config,
has_readme,
)
)
df = pd.DataFrame(
metadata,
columns=[
"model_id",
"pipeline_tag",
"tags",
"has_pipeline_tag",
"has_config",
"has_readme",
],
)
return df
Thanks for the analysis!
re: Semantic similarity. The user overrode the pipeline tag in the METADATA some months ago. I agree that they should be sentence-similarity
, which is a fairly recent task.
Some brainstorm ideas for further analysis:
re: If I understand correctly, we have 2394 repos without a pipeline tag and we might want to put some effort on those. From those repos:
config.json
. Is there anything we can obtain from the config to understand what they are? (unrelated: I think we assume that these are Transformer based, so maybe it also makes sense to add transformers
tag, which we currently don't use at the moment.)we currently assume (see code in ModelInfo.ts
, it's fairly short to read) that models that have a config.json
file and no library name in their tags
is a transformers
model.
Lots of model repos are empty (or WIPs) so I wouldn't aim to classify all models.
PS/ fixed the metadata for one model in https://huggingface.co/Sahajtomar/french_semantic/commit/2392beb954ae32dafa587e03f278a0158d1da7b5
and the other in https://huggingface.co/Sahajtomar/sts-GBERT-de/commit/935c5217fd8f03de0ccd9e6e3f34e21651573e84
Finally, cc'ing model author @Sahajtomar for visibility. Let us know if any issue 🙂
we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.
Yes, I was suggesting that maybe we should make it more explicit and add the transformers
tag to those. As we intend to expand our usage to more libraries, longer-term I think we should reduce the magic that happens in our side, and have transformers
as an explicit tag. (related PR https://github.com/huggingface/moon-landing/pull/746)
Yes agreed, probably not short term but when we start adding more validation to the yaml block in models we can 1/ add this rule 2/ update all updateable models on the hub
while checking for suitable datasets for model evaluation, i discovered several models have typos / non-conventional naming for the datasets:
tag.
using some fuzzy string matching i compiled a list of (model, dataset, closest_dataset_match)
where the closest match to a canonical dataset in datasets
was based (arbitrarily) on whether the levenstein distance is > 85.
i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?
i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?
Yes 👍
Also check out https://observablehq.com/@huggingface/kaggle-dataset-huggingface-modelhub from @severo which looks great (cc @gary149)
Using the
huggingface_hub
library, I was able to collect some statistics on the 9,984 models that are currently hosted on the Hub. The main goal of this exercise was to find answers to the following questions:BertForSequenceClassification
architecture is likely to be about text classification; similarly for the otherModelNameForXxx
architectures.Number of models per dimension
Without applying any filters on the architecture names, the number of models per criterion is shown in the table below:
These numbers include models for which a task may not be easily inferred from the architecture alone. For example
BertModel
would presumably be associated with afeature-extraction
task, but these are not simple to evaluate.By applying a filter on the architecture name to contain any of "For", "MarianMTModel" (translation), or "LMHeadModel" (language modelling), we arrive at the following table:
Architecture frequencies
Some models either have no architecture (e.g. the info is missing from the
config.json
file or the model belongs to another library like Flair), or multiple ones:Based on these counts, it thus makes sense to just focus on models with a single architecture.
Number of models per task
For models with a single architecture, I extract the task names from the architecture name according to the following mappings:
The resulting frequency counts are shown below:
Fun stuff
We can visualise which tasks are connected to which datasets as a graph. Here we show the top 10 tasks (measured by node connectivity) with the top 20 datasets marked in orange