huggingface / hub-docs

Docs of the Hugging Face Hub
http://hf.co/docs/hub
Apache License 2.0
268 stars 229 forks source link

Improve the tagging of models #32

Closed lbourdois closed 2 years ago

lbourdois commented 2 years ago

Hi,

I hope first of all to open this problem in the right place (I hesitated to post in this repo but it seems less active with only issues in 1 year: https://github.com/huggingface/model_card).

I'll illustrate my observation by talking about French models, but the logic applies to any language.

I found by chance the following model on the hub: https://huggingface.co/dbmdz/electra-base-french-europeana-cased-generator An electra model in French released more than a year ago and I had never heard of it? How it's possible? I realized that it was simply because it was not referenced correctly (no "fr" tag). This probably explains why it was downloaded only 12 times last month. I think it's a shame.

So I did a little more research to see if I had missed any other French models that were not referenced and here is the list I came up with:

This represents 24 models. If we calculate in relation to what is announced by the "fr" filter (https://huggingface.co/models?language=fr), it's about 7.5% (24/(24+300)) of the models in French that are not referenced. So I think it would be important to improve the reference.

I have two ideas to submit:

A slightly different but related topic is multilingual models. Should multilingual models to be tag with all the languages they contain or not? This solution has been adopted for Helsinki NLP templates (an example: https://huggingface.co/Helsinki-NLP/opus-mt-af-fr tagged in "af" and "fr"). But this is not the case for Geotrend models (an example: https://huggingface.co/Geotrend/bert-base-en-fr-cased, contains neither "en" nor "fr") or for T-systems (an example: https://huggingface.co/T-Systems-onsite/cross-en-fr-roberta-sentence-transformer contains neither "en" nor "fr"). I haven't checked with datasets, but I guess the problem must apply there too. So I think this would be a point to harmonize.

Have a nice day :)

julien-c commented 2 years ago

Hi @lbourdois thanks for opening this issue.

We are thinking of implementing a simple-to-use kind of Pull Request workflow that would make sense on models, datasets, and spaces.

We don't want to make it as complex/feature-rich as GitHub PRs for instance, as we want to build the most specific set of tools for ML.

We won't ship this in the super short-term though. In the meantime, what we suggest is to reach out to model authors (can be in a GitHub issue for instance, or on our Forum on discuss.huggingface.co) and ask them to update their metadata. It is a bit tedious, so let us know if we can help automate this 🙂

lbourdois commented 2 years ago

Hi @julien-c

What you are planning seems to address the problem.

I don't know if until this feature is available, it would be possible to just display a generic message to a user pushing a new model on the hub to tell him to think about filling in the tags (of languages in this case, but we can also think about the task handled by the model for example) (+ think about filling in the model's card?). I don't think it's very time consuming and this simple reminder would limit future proofreading work. I say that because for the test I just did looking at the last 30 models added to the hub (https://huggingface.co/models?sort=modified), out of the 28 NLP models, 23 did not have the language tag.

For the French models I indicated, I will as you suggest try to contact the authors and will come back to you in 10-14 days to indicate those who have not responded.

lbourdois commented 2 years ago

Hi everyone,

I mention you here because I can't include more than two URL links and mention more than 2 people on the Hugging Face forum.

This topic aims to add a "fr" tag to models in French that don't have them at the moment so that they can be visible by the largest number of people via the Hub (see above for more informations).

As I can't add it myself, I'm trying to get in touch with the authors of the concerned models to update their metadata.

Thank you in advance for your cooperation,

stefan-it commented 2 years ago

Hi @lbourdois ,

sorry for that! I've uploaded the model cards for our @dbmdz models incl. the correct language tag :)

stefan-it commented 2 years ago

@PhilipMay could maybe introduce the language tag for the 'T-Systems-onsite/cross-en-fr-roberta-sentence-transformer` model.

PhilipMay commented 2 years ago

@PhilipMay could maybe introduce the language tag for the 'T-Systems-onsite/cross-en-fr-roberta-sentence-transformer` model.

Thanks for the hint. Done.

abhilash1910 commented 2 years ago

Hi @lbourdois , Have added the french tag for https://huggingface.co/abhilash1910/french-roberta

elishowk commented 2 years ago

Hate-speech-CNERG can you add the tag "fr" to the metadata of the following model please ?

This organisation is composed of @punyajoy, @debjoy10, pinguing them to check this out

lbourdois commented 2 years ago

Hi @julien-c,

So it's been two weeks since the attempt to contact the authors of the models. Some tags could be added but the majority of the models are still not listed. I don't know if there is another possibility than waiting for the tool you mentioned.

Since my last post, I also noticed a point about datasets and tags: there can be several tags for one language. An example : there are 4 datasets tagged in "fr-FR" (https://huggingface.co/datasets?languages=languages:fr-FR), which are about French but are not found if you sort the datasets with the tag "fr" (https://huggingface.co/datasets?languages=languages:fr) This would have to be investigated, but this phenomenon can be also find with other examples:

I think that there is an interest in keeping these sub-tags allowing to take into account the different variants of a language to be able to build extremely specific models (a model in French from France for example) or on the contrary models containing the most varieties possible (a model in French taking into account the most varieties possible: https://en.wikipedia.org/wiki/French_language#Varieties). This would also allow to give a better visibility to the models/datasets of sign languages. We can then imagine a simple button system displaying a drop-down menu on the languages page (https://huggingface.co/languages) when we want to see all possible variants for a given language:

image

And then we would have the subnumbers by subtags (the xx) and their sum would be equal to the number displayed for a given language. However, care should be taken to count only once a dataset with several variants of the same language.

I don't know what you think about this idea.

Have a nice day :)

punyajoy commented 2 years ago

Hate-speech-CNERG can you add the tag "fr" to the metadata of the following model please ?

This organisation is composed of @punyajoy, @Debjoy10, pinguing them to check this out

Added the language tag 👍

elishowk commented 2 years ago

Hi @ydshieh, if you get this message, could you add the tag "fr" to the metadata of the following model please ?

Thanks

elishowk commented 2 years ago

Hi @lbourdois, I didn't find a github user handle for user WikinewsSum, it seems like an anynonymous system user. @osanseviero do you happen to know who's the maintainer ?

elishowk commented 2 years ago

Hi there, I just reached out by mail to the last user of your list, WikinewsSum. Let's wait and see. Regards.

julien-c commented 2 years ago

@lbourdois I saw you started using the Hub PR feature on hf.co to fix those. Thank you so much!

Please, let us know of any improvement we can make to make this as easy as possible

lbourdois commented 2 years ago

@julien-c I made my first feedback here: https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/1

There is a point on which I wonder though. To update the tags of the datasets, no problem, it is enough to open a PR bringing a modification to the README file. However, to update the tags of the models, a PR on the README doesn't seem to be enough according to the feedback I've just received: https://huggingface.co/stanfordnlp/corenlp-french/discussions/1 So I wonder if the PR Hub allows you to make changes to the models tags or not. If yes, then what would I have misunderstood? If not, is there any way to add this feature?

Edit: It worked well for https://huggingface.co/Felix92/doctr-dummy-tf-sar-resnet31/discussions/1, so there would be models where this is possible and others where it is not? 🤔

osanseviero commented 2 years ago

@lbourdois these model repos are automatically generated from a Stanford repository so in this case they need to fix the script that creates the repo

https://github.com/stanfordnlp/huggingface-models

lbourdois commented 2 years ago

Thank you @osanseviero for enlightening me on this topic

lbourdois commented 2 years ago

All the models listed are now well tagged or a Hub PR has been submitted for them to be tagged. I am therefore closing this issue. Thank you all :)

osanseviero commented 2 years ago

Thank you for the contribution!! :fire: :fire: