huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.71k stars 2.58k forks source link

Meta-datasets (GLUE/XTREME/...) – Special care to attributions and citations #153

Open thomwolf opened 4 years ago

thomwolf commented 4 years ago

Meta-datasets are interesting in terms of standardized benchmarks but they also have specific behaviors, in particular in terms of attribution and authorship. It's very important that each specific dataset inside a meta dataset is properly referenced and the citation/specific homepage/etc are very visible and accessible and not only the generic citation of the meta-dataset itself.

Let's take GLUE as an example:

The configuration has the citation for each dataset included (e.g. here) but it should be copied inside the dataset info so that, when people access dataset.info.citation they get both the citation for GLUE and the citation for the specific datasets inside GLUE that they have loaded.

dseddah commented 4 years ago

As @yoavgo suggested, there should be the possibility to call a function like nlp.bib that outputs all bibtex ref from the datasets and models actually used and eventually nlp.bib.forreadme that would output the same info + versions numbers so they can be included in a readme.md file.

thomwolf commented 4 years ago

Actually, double checking with @mariamabarham, we already have this feature I think.

It's like this currently:

>>> from nlp import load_dataset
>>> 
>>> dataset = load_dataset('glue', 'cola', split='train')
>>> print(dataset.info.citation)
@article{warstadt2018neural,
  title={Neural Network Acceptability Judgments},
  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1805.12471},
  year={2018}
}
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

Note that each GLUE dataset has its own citation. Please see the source to see
the correct citation for each contained dataset.

What do you think @dseddah?

dseddah commented 4 years ago

Looks good but why would there be a difference between the ref in the source and the one to be printed?

thomwolf commented 4 years ago

Yes, I think we should remove this warning @mariamabarham.

It's probably a relic of tfds which didn't have the same way to access citations.