Open thomwolf opened 4 years ago
As @yoavgo suggested, there should be the possibility to call a function like nlp.bib that outputs all bibtex ref from the datasets and models actually used and eventually nlp.bib.forreadme that would output the same info + versions numbers so they can be included in a readme.md file.
Actually, double checking with @mariamabarham, we already have this feature I think.
It's like this currently:
>>> from nlp import load_dataset
>>>
>>> dataset = load_dataset('glue', 'cola', split='train')
>>> print(dataset.info.citation)
@article{warstadt2018neural,
title={Neural Network Acceptability Judgments},
author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
journal={arXiv preprint arXiv:1805.12471},
year={2018}
}
@inproceedings{wang2019glue,
title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
note={In the Proceedings of ICLR.},
year={2019}
}
Note that each GLUE dataset has its own citation. Please see the source to see
the correct citation for each contained dataset.
What do you think @dseddah?
Looks good but why would there be a difference between the ref in the source and the one to be printed?
Yes, I think we should remove this warning @mariamabarham.
It's probably a relic of tfds which didn't have the same way to access citations.
Meta-datasets are interesting in terms of standardized benchmarks but they also have specific behaviors, in particular in terms of attribution and authorship. It's very important that each specific dataset inside a meta dataset is properly referenced and the citation/specific homepage/etc are very visible and accessible and not only the generic citation of the meta-dataset itself.
Let's take GLUE as an example:
The configuration has the citation for each dataset included (e.g. here) but it should be copied inside the dataset info so that, when people access
dataset.info.citation
they get both the citation for GLUE and the citation for the specific datasets inside GLUE that they have loaded.