quanteda / quanteda.corpora

A collection of corpora for quanteda
18 stars 5 forks source link

Add UN general debate corpus from Mikhaylov, Baturo, and Dasandi (2017) #6

Closed kbenoit closed 5 years ago

kbenoit commented 6 years ago

UN general debate texts, from Mikhaylov, Slava; Baturo, Alexander; Dasandi, Niheer, (2017) doi:10.7910/DVN/0TJX8Y, Harvard Dataverse, V4:

From @sjankin:

would you be interested in adding UN speeches to quanteda data?

They are currently all on Dataverse:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0TJX8Y

So either full corpus or a subset that may be of interest. It’s the speeches by heads of state or government at the opening of each annual session in September. This year, for example, it was The Donald’s inaugural speech. From the Dataverse it’s only the plain text files that would be useful (UNGDC 1970-2017.zip). Here’s a direct link to that archive (~60Mb):

https://www.dropbox.com/s/l9jrnqwyip9x5a3/UNGDC%201970-2017.zip?dl=0

I guess you don’t need any auxiliary files like raw PDFs.

The data are structured in separate folders by Year (from 1970 to 2017) with separate .txt file for each country speech.

There are 7,897 speeches in plain text format (UTF8). Speeches are structured by Year (Session). Each speech is named using the following convention: ISO 3166-1 alpha-3 country code, followed by the UN Session number, followed by year. E.g. USA_72_2017.txt will be for the 2017 (or 72nd annual UN Session).

That was my ingestion routine (with change of directory) to get the quanteda corpus object:

DATA_DIR <- "~/Dropbox/Research/UN Data/" 
ungd_files <- readtext(paste0(DATA_DIR, "Converted sessions/*"), 
                                 docvarsfrom = "filenames", 
                                 dvsep="_", 
                                 docvarnames = c("Country", "Session", "Year"))
ungd_corpus <- corpus(ungd_files, text_field = "text") 
stefan-mueller commented 5 years ago

Just see this issue now. Two people asked me already about this corpus, so it would be nice to have it in quanteda.corpora. If you want, I could give it a go and add the entire corpus to the package, including a man page with variable descriptions and links to the R&P paper and the Dataverse.

stefan-mueller commented 5 years ago

The.rds file of the entire corpus object with all texts between 1970 and 2017 is 46MB (too large for the package). But we could offer this corpus through download() and add a smaller version (maybe all documents from 2017) to the repo?

sjankin commented 5 years ago

I put some bits and pieces on the UNGD corpus on GitHub and also uploaded full corpus archive on GitHub here.

I think adding 2017 General Debate to the repo is fine. I should also have the 2018 GD data ready in the next month (and will add it to the corpus).

stefan-mueller commented 5 years ago

Great! What about this: you let me know when you have added the 2018 GD and I add the 2018 debate to the package, including the continent of each country (using countrycode) and an economic variable, for instance GDP per capita.

From a teaching perspective, this would be a great corpus to study how to apply the _subset and _group functions and the scaling models.

And if students want to work with the entire corpus, they can download it from your repo and import it with readtext.

sjankin commented 5 years ago

Sounds good!

kbenoit commented 5 years ago

That should be okay. The quanteda.corpora is designed to be a source of demonstration datasets used in articles, rather than a source of replication materials, but if you think the UN corpus would be good for demonstrating methodological issues and if it's not massive, IO see no problem adding it.

A good general solution would be to create a new repo that contains package template for corpus datasets, with test templates, that people could fork to use to create their own packages around a dataset, and park that on GitHub. We could even supply a vignette template, and suggest that authors use the vignette for replication or demonstration materials.

stefan-mueller commented 5 years ago

@sjankin, I was wondering whether the 2018 UNGD speeches are now available? If not, I could create the corpus object we discussed above using the 2017 UNGD speeches and add the country metadata for 2017.

sjankin commented 5 years ago

@stefan-mueller let's do 2017. With the move to Berlin I am way behind on anything that is not related to German state bureaucracy. Let me know if you need any additional information from me or I can help in any way. Thanks!

stefan-mueller commented 5 years ago

Totally understandable, I went through the same around one month ago. I'll prepare the 2017 UNGD corpus and get back to you if I need your help.

stefan-mueller commented 5 years ago

@sjankin, you might have a look at #8 and check whether you would like to add more information to the documentation of the corpus or edit some details. Feel free to make changes.

sjankin commented 5 years ago

@stefan-mueller, looks good. Thanks for putting it together! YUG/SRB point - Serbia is treated as a legal successor state to Yugoslavia, hence kept the latter ISO code. But it's a small point and changes make sense for GDP data you added.