Closed kbenoit closed 5 years ago
Just see this issue now. Two people asked me already about this corpus, so it would be nice to have it in quanteda.corpora. If you want, I could give it a go and add the entire corpus to the package, including a man page with variable descriptions and links to the R&P paper and the Dataverse.
The.rds
file of the entire corpus object with all texts between 1970 and 2017 is 46MB (too large for the package). But we could offer this corpus through download()
and add a smaller version (maybe all documents from 2017) to the repo?
data_corpus_ungd2017
: stored in the data
folderdata_corpus_ungd
: stored in a Dropbox (like data_corpus_guardian
), accessible through quanteda.corpora::download(data_corpus_ungd)
Great! What about this: you let me know when you have added the 2018 GD and I add the 2018 debate to the package, including the continent of each country (using countrycode) and an economic variable, for instance GDP per capita.
From a teaching perspective, this would be a great corpus to study how to apply the _subset
and _group
functions and the scaling models.
And if students want to work with the entire corpus, they can download it from your repo and import it with readtext.
Sounds good!
That should be okay. The quanteda.corpora is designed to be a source of demonstration datasets used in articles, rather than a source of replication materials, but if you think the UN corpus would be good for demonstrating methodological issues and if it's not massive, IO see no problem adding it.
A good general solution would be to create a new repo that contains package template for corpus datasets, with test templates, that people could fork to use to create their own packages around a dataset, and park that on GitHub. We could even supply a vignette template, and suggest that authors use the vignette for replication or demonstration materials.
@sjankin, I was wondering whether the 2018 UNGD speeches are now available? If not, I could create the corpus object we discussed above using the 2017 UNGD speeches and add the country metadata for 2017.
@stefan-mueller let's do 2017. With the move to Berlin I am way behind on anything that is not related to German state bureaucracy. Let me know if you need any additional information from me or I can help in any way. Thanks!
Totally understandable, I went through the same around one month ago. I'll prepare the 2017 UNGD corpus and get back to you if I need your help.
@sjankin, you might have a look at #8 and check whether you would like to add more information to the documentation of the corpus or edit some details. Feel free to make changes.
@stefan-mueller, looks good. Thanks for putting it together! YUG/SRB point - Serbia is treated as a legal successor state to Yugoslavia, hence kept the latter ISO code. But it's a small point and changes make sense for GDP data you added.
UN general debate texts, from Mikhaylov, Slava; Baturo, Alexander; Dasandi, Niheer, (2017) doi:10.7910/DVN/0TJX8Y, Harvard Dataverse, V4:
From @sjankin: