huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.71k stars 2.58k forks source link

Add a metadata field for when source data was produced #3625

Open davanstrien opened 2 years ago

davanstrien commented 2 years ago

Is your feature request related to a problem? Please describe. The current problem is that information about when source data was produced is not easily visible. Though there are a variety of metadata fields available in the dataset viewer, time period information is not included. This feature request suggests making metadata relating to the time that the underlying source data was produced more prominent and outlines why this specific information is of particular importance, both in domain-specific historic research and more broadly.

Describe the solution you'd like

There are a variety of metadata fields exposed in the dataset viewer (license, task categories, etc.) These fields make this metadata more prominent both for human users and as potentially machine-actionable information (for example, through the API). I would propose to add a metadata field that says when some underlying data was produced. For example, a dataset would be labelled as being produced between 1800-1900.

Describe alternatives you've considered This information is sometimes available in the Datacard or a paper describing the dataset. However, it's often not that easy to identify or extract this information, particularly if you want to use this field as a filter to identify relevant datasets.

Additional context

I believe this feature is relevant for a number of reasons:

open questions

This is a slightly amorphous feature request - I would be happy to discuss further/try and propose a more concrete solution if this seems like something that could be worth considering. I realise this might also touch on other parts of the 🤗 hubs ecosystem.

severo commented 2 years ago

A question to the datasets maintainers: is there a policy about how the set of allowed metadata fields is maintained and expanded?

Metadata are very important, but defining the standard is always a struggle between allowing exhaustivity without being too complex. Archivists have Dublin Core, open data has https://frictionlessdata.io/, geo has ISO 19139 and INSPIRE, etc. and it's always a mess! I'm not sure we want to dig too much into it, but I'm curious to know if there has been some work on the metadata standard.

davanstrien commented 2 years ago

Metadata are very important, but defining the standard is always a struggle between allowing exhaustivity without being too complex. Archivists have Dublin Core, open data has frictionlessdata.io, geo has ISO 19139 and INSPIRE, etc. and it's always a mess! I'm not sure we want to dig too much into it, but I'm curious to know if there has been some work on the metadata standard.

I thought this is a potential issue with adding this field since it might be hard to define what is general enough to be useful for most data vs what becomes very domain-specific. Potentially adding one extra field leads to more and more fields in the future.

Another issue is that there are some metadata standards around data i.e. datacite, but not many aimed explicitly at ML data afaik. Some of the discussions around metadata for ML are also more focused on versioning/managing data in production environments. My thinking is that here, some reference to the time of production would also often be tracked/relevant, i.e. for triggering model training, so having this information available in the hub would also help address this use case.

davanstrien commented 2 years ago

Adding a relevant paper related to this topic: TimeLMs: Diachronic Language Models from Twitter

severo commented 2 years ago

Related: https://github.com/huggingface/datasets/issues/3877

severo commented 2 years ago

Also related: the Data Catalog Vocabulary - DCAT standard will be discussed in a new Working Group at the W3C: https://www.w3.org/2022/06/dx-wg-charter.html