Support for datacards/datasheets

davanstrien commented 2 years ago

Thanks for this library -- I've just started playing with this, and it looks like it is going to be super useful :)

Are there any plans for also supporting the creation of datacard/datasheets in this library?

I think this could be quite useful for a few use cases. In particular being able to template out some standard information might be useful for organizations which might want to standardize some information in a Datacard for example, in https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata, we may want to be able to pass in a list of names or OCRDIDs to go under https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata#dataset-curators.

This could end up looking something like:

datacard = DataCard.from_template(
    card_data=DataCardData(  # Card metadata object that will be converted to YAML block
        license='mit',
        tags=['image-classification'],  
    ... 
    ),
    template_path='my_data_template.md', # The template we just wrote!
    dataset_id='cool-model',  # Jinja template kwarg
    external_url='data.bl....', # Jinja template kwarg
   curators=['name1', 'name2'] 
)

I think this could also be useful for organizations/users using the hub to store data that is actively being developed/annotated. They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset? I had planned to add something like this to https://github.com/davanstrien/hugit-cli/ but would rather piggyback on something else!

nateraw commented 2 years ago

Absolutely! This was why I originally set up ModelCard to inherit from RepoCard. RepoCard currently inits a CardData object though, which is specific to models (which isn't really right). It would be great if we:

added this feature as well as a default dataset card using the one here.
Figure out a better way of instantiating the card data using the correct object (CardData/DataCardData).
Also perhaps worth renaming CardData -> ModelCardData to avoid confusion?

As for this:

They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset

Right now, once the card is written, it's just text. So there's no way of automagically updating the card's text itself without recreating the card. Recreating the card would be easy though, as you'd just pass the updated values to the from_template fn again and re-push the new card to overwrite the old. Does that work for you?

CC: @lhoestq @mariosasko - This might be nice for folks who are creating their own datasets programatically.

nateraw commented 2 years ago

Small update here - currently, you can abuse ModelCard.from_template and CardData to upload data sheets/data cards.

I'm doing that here. None of the fields in CardData are required, so you can just pass whatever you want in the yaml header data. When pushing, just make sure to supply repo_type="dataset" and it'll validate the yaml you create against the dataset YAML block schema (which actually doesn't have any required fields...it just requires that the YAML block isn't empty).

davanstrien commented 2 years ago

Thanks! I was planning to play around with this a bit more tomorrow -- I'll let you know how I get on.

nateraw / modelcards

Support for datacards/datasheets #36