kbenoit / sophistication

R package associated with Benoit, Munger and Spirling (2017) paper(s)
42 stars 7 forks source link

CRAN
Version R build
status Coverage
Status

Code for use in measuring the sophistication of political text

“Measuring and Explaining Political Sophistication Through Textual Complexity” by Kenneth Benoit, Kevin Munger, and Arthur Spirling. This package is built on quanteda.

How to install

Using the devtools package:

devtools::install_github("kbenoit/sophistication")

If you have trouble with your sophistication installation using devtools, check that you have pre-installed conda or miniconda and are using the correct version of spacyr. Try installing sophistication with the following steps:

devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
library("spacyr")
spacy_install()
spacy_initialize()
devtools::install_github("kbenoit/sophistication")

For more information please see the spacyr documentation here: https://cran.r-project.org/web/packages/spacyr/readme/README.html .

Included Data

new name original name description
data_corpus_fifthgrade fifthCorpus Fifth-grade reading texts
data_corpus_crimson crimsonCorpus Editorials from the Harvard Crimson
data_corpus_partybroadcast partybcastCorpus UK political party broadcasts
data_corpus_presdebates presDebateCorpus US presidential debates 2016

How to use

library("sophistication")
## Loading required package: quanteda
## Package version: 2.1.9000
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
## spacy python option is already set, spacyr will use:
##  condaenv = "spacy_condaenv"
## successfully initialized (spaCy Version: 2.3.2, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")

# make the snipepts of one sentence, between 100-350 chars in length
data(data_corpus_sotu, package = "quanteda.corpora")
snippetData <- snippets_make(data_corpus_sotu, nsentence = 1, minchar = 150, maxchar = 250)
# clean up the snippets
snippetData <- snippets_clean(snippetData)
## Cleaning 20,662 snippets...
##    removed 1,166 snippets containing numbers of at least 1,000
##    removed 273 snippets containing ALL CAPS titles
##    ...finished.

# randomly sample three snippets
set.seed(10)
testData <- snippetData[sample(1:nrow(snippetData), 5), ]

# generate pairs for a minimum spanning tree
(snippetPairsMST <- pairs_regular_make(testData))
##           docID1 snippetID1
## 1   Madison-1813    2500042
## 2 Roosevelt-1938   14900134
## 3     Grant-1872    8400141
## 4   Johnson-1966   18100222
##                                                                                                                                                                                                                                                    text1
## 1                                                                         The minister plenipotentiary of the United States at Paris had not been enabled by proper opportunities to press the objects of his mission as prescribed by his instructions.
## 2 We have but to talk with hundreds of small bankers throughout the United States to realize that irrespective of local conditions, they are compelled in practice to accept the policies laid down by a small number of the larger banks in the Nation.
## 3                        Ten additional stations have been established in the United States, and arrangements have been made for an exchange of reports with Canada, and a similar exchange of observations is contemplated with the West India Islands.
## 4                                                                                    We will respond if others reduce their use of force, and we will withdraw our soldiers once South Vietnam is securely guaranteed the right to shape its own future.
##           docID2 snippetID2
## 1 Roosevelt-1938   14900134
## 2     Grant-1872    8400141
## 3   Johnson-1966   18100222
## 4   Clinton-1998   21900127
##                                                                                                                                                                                                                                                    text2
## 1 We have but to talk with hundreds of small bankers throughout the United States to realize that irrespective of local conditions, they are compelled in practice to accept the policies laid down by a small number of the larger banks in the Nation.
## 2                        Ten additional stations have been established in the United States, and arrangements have been made for an exchange of reports with Canada, and a similar exchange of observations is contemplated with the West India Islands.
## 3                                                                                    We will respond if others reduce their use of force, and we will withdraw our soldiers once South Vietnam is securely guaranteed the right to shape its own future.
## 4                                      And I think we should say to all the people we're trying to represent here that preparing for a far-off storm that may reach our shores is far wiser than ignoring the thunder till the clouds are just overhead.

We can also use the package function to generate “gold” questions based on readability differences:

# make a lot of candidate pairs
snippetPairsAll <- pairs_regular_make(snippetData[sample(1:nrow(snippetData), 1000), ])
# make 10 gold from these
pairs_gold_make(snippetPairsAll, n.pairs = 10)
## Starting the creation of gold questions...
##    computing Flesch readability measure
##    selecting top different 10 pairs
##    applying min.diff.quantile thresholds of 2.89, 34.57
##    creating gold_reason text
##    ...finished.
##              docID1 snippetID1
## 1         Taft-1910   12200029
## 2        Grant-1872    8400202
## 3         Polk-1846    5800321
## 4        Obama-2010   23100392
## 5       Monroe-1818    3000043
## 6       Hoover-1929   14100290
## 7  Eisenhower-1953b   16600327
## 8      Carter-1979b   19800517
## 9       Arthur-1884    9600167
## 10       Nixon-1971   18600034
##                                                                                                                                                                                                                                              text1
## 1                                                            In completion of this work, the regulations agreed upon require congressional legislation to make them effective and for their enforcement in fulfillment of the treaty stipulations.
## 2                                                                          The work which in some of them for some years has been in arrears has been brought down to a recent date, and in all the current business is being promptly dispatched.
## 3                                                        The reasons which induced me to recommend the measure at that time still exist, and I again submit the subject for your consideration and suggest the importance of early action upon it.
## 4                                                                 And it lives on in all the Americans who've dropped everything to go some place they've never been and pull people they've never known from rubble, prompting chants of "U.S.A.!
## 5                                                                 Even if the territory had been exclusively that of Spain and her power complete over it, we had a right by the law of nations to follow the enemy on it and to subdue him there.
## 6                                                                           Any other attitude by the Federal Government will undermine one of the most precious possessions of the American people; that is, local and individual responsibility.
## 7  I shall shortly send you specific recommendations for establishing such an appropriate commission, together with a reorganization plan defining new administrative status for all Federal activities in health, education, and social security.
## 8                         I recently announced my intention to submit legislation to Congress protecting the rights of the press, and others preparing materials for publication, from searches and seizures undertaken without judicial approval.
## 9                       The Secretary of War submits the report of the Chief of Engineers as to the practicability of protecting our important cities on the seaboard by fortifications and other defenses able to repel modern methods of attack.
## 10                                                                                     Over the next 2 weeks, I will call upon Congress to take action on more than 35 pieces of proposed legislation on which action was not completed last year.
##             docID2 snippetID2
## 1   Cleveland-1888   10000309
## 2    Coolidge-1927   13900428
## 3  Eisenhower-1960   17400074
## 4        Taft-1912   12400227
## 5     Carter-1978b   19600273
## 6       Obama-2016   23700349
## 7       Grant-1870    8200062
## 8   Roosevelt-1936   14700068
## 9     Lincoln-1861    7300176
## 10    Carter-1980b   20000194
##                                                                                                                                                                                                                                               text2
## 1                                                                                         It remains to make the most of it, and when that shall be done the curse will be lifted, the Indian race saved, and the sin of their oppression redeemed.
## 2                                                                                 Stimson, former Secretary of War, was sent there to cooperate with our diplomatic and military officers in effecting a settlement between the contending parties.
## 3                                                              These qualities of determination are particularly essential because of the fact that the process of improvement will necessarily be gradual and laborious rather than revolutionary.
## 4  The good offices which the commissioners were able to exercise were instrumental in bringing the contending parties together and in furnishing a basis of adjustment which it is hoped will result in permanent benefit to the Dominican people.
## 5                                                                                         This year we will continue our deregulatory efforts in the legislative and administrative areas in order to reduce anti-competitive practices and abuses.
## 6                               I see it in the elderly woman who will wait in line to cast her vote as long as she has to, the new citizen who casts his vote for the first time, the volunteers at the polls who believe every vote should count.
## 7                                                                                   Its possession by us will in a few years build up a coastwise commerce of immense magnitude, which will go far toward restoring to us our lost merchant marine.
## 8                                                                  In March, 1933, I appealed to the Congress of the United States and to the people of the United States in a new effort to restore power to those to whom it rightfully belonged.
## 9                                                             In a storm at sea no one on board can wish the ship to sink, and yet not unfrequently all go down together because too many will direct and no single mind can be allowed to control.
## 10                                                              If unemployment should dramatically increase, I will be prepared to consider actions to counter that increase, consistent with our overriding concern about accelerating inflation.
##        read1     read2  readdiff _golden easier_gold
## 1   14.49885 72.045000 -57.54615    TRUE           2
## 2   71.24875  9.750000  61.49875    TRUE           1
## 3   39.52375 -8.044000  47.56775    TRUE           1
## 4   60.76500 14.649211  46.11579    TRUE           1
## 5   50.44500 -4.101304  54.54630    TRUE           1
## 6    5.49200 58.347727 -52.85573    TRUE           2
## 7  -18.63875 51.958621 -70.59737    TRUE           2
## 8    4.36500 55.377941 -51.01294    TRUE           2
## 9   12.84500 59.528649 -46.68365    TRUE           2
## 10  57.79310  2.700000  55.09310    TRUE           1
##                                                                                                                                                                                                      easier_gold_reason
## 1  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 2  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 3  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 4  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 5  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 6  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 7  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 8  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 9  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 10 Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.

There is a lot more than this, of course. Our documentation will improve as we develop the package with an aim to eventual CRAN release.