Open davanstrien opened 2 years ago
Absolutely! This was why I originally set up ModelCard
to inherit from RepoCard
. RepoCard
currently inits a CardData
object though, which is specific to models (which isn't really right). It would be great if we:
CardData
/DataCardData
). CardData
-> ModelCardData
to avoid confusion?As for this:
They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset
Right now, once the card is written, it's just text. So there's no way of automagically updating the card's text itself without recreating the card. Recreating the card would be easy though, as you'd just pass the updated values to the from_template
fn again and re-push the new card to overwrite the old. Does that work for you?
CC: @lhoestq @mariosasko - This might be nice for folks who are creating their own datasets programatically.
Small update here - currently, you can abuse ModelCard.from_template
and CardData
to upload data sheets/data cards.
I'm doing that here. None of the fields in CardData
are required, so you can just pass whatever you want in the yaml header data. When pushing, just make sure to supply repo_type="dataset"
and it'll validate the yaml you create against the dataset YAML block schema (which actually doesn't have any required fields...it just requires that the YAML block isn't empty).
Thanks! I was planning to play around with this a bit more tomorrow -- I'll let you know how I get on.
Thanks for this library -- I've just started playing with this, and it looks like it is going to be super useful :)
Are there any plans for also supporting the creation of datacard/datasheets in this library?
I think this could be quite useful for a few use cases. In particular being able to template out some standard information might be useful for organizations which might want to standardize some information in a Datacard for example, in https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata, we may want to be able to pass in a list of names or OCRDIDs to go under https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata#dataset-curators.
This could end up looking something like:
I think this could also be useful for organizations/users using the hub to store data that is actively being developed/annotated. They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset? I had planned to add something like this to https://github.com/davanstrien/hugit-cli/ but would rather piggyback on something else!