liserman / handcodeR

R-Package to fascilitate the annotation of textdata by hand in R
18 stars 1 forks source link

handcodeR

codecov CRAN_Status_Badge CRAN_latest_release_date Downloads Downloads DOI

handcodeR

R-Package to facilitate the annotation of text data by hand in R.

The goal of the handcodeR package is to provide an easy to use app to annotate text data by hand. Often times when we work with text data, we rely on hand coded annotations of texts either as unit of analysis in itself, or as training and test samples for supervised machine learning tools to classify text data. handcodeR offers a Shiny-App that can be run within R to annotate individual texts one by one in up to six different variables. To do so, the package uses the function handcode():

I present a short step-by-step guide as well as the functions in more detail below.

How to cite this package

To cite the handcodeR package, you can use:

Isermann, Lukas. (2023). handcodeR: Text annotation app. R package version 0.1.2. http://doi.org/10.5281/zenodo.8075100.

You can also access the preferred citation as well as the bibtex entry for the handcodeR Package via R:

citation("handcodeR")
#> To cite handcodeR in publications, please use:
#> 
#>   Isermann, Lukas. 2023. handcodeR: Text annotation app. R package
#>   version 0.1.2. https://doi.org/10.5281/zenodo.8075100
#> 
#> Ein BibTeX-Eintrag für LaTeX-Benutzer ist
#> 
#>   @Misc{,
#>     title = {handcodeR: Text annotation app},
#>     author = {Lukas Isermann},
#>     year = {2023},
#>     note = {R package version 0.1.2},
#>     doi = {10.5281/zenodo.8075100},
#>     url = {https://github.com/liserman/handcodeR},
#>   }

Installation

A stable version of handcodeR can be directly accessed on CRAN:

install.packages("handcodeR", force = TRUE)

To install the latest development version of handcodeR directly from GitHub use:

library(devtools) # Tools to Make Developing R Packages Easier
devtools::install_github("liserman/handcodeR", force = TRUE)

How to use this package

First, load the package

library(handcodeR) # classify texts by hand in R

In the following, we are going to exemplify the workflow of the package using a minimal working example.

The workflow of the package follows a simple rule:

  1. If you start the coding process, initialize the coding with handcode() by providing a text vector of texts you wish to annotate as data input, and up to three named character vectors of categories you want to code. Hand code as much data as you would like and return the output data frame via the save and exit-button.

  2. If you want to resume coding that you have already been working on, continue the coding with handcode() by providing the data frame you received as output from your last call of handcode() as data input.

handcode

The main function of the handcodeR package is handcode(). handcode() takes either a vector of texts and up to 6 named character vectors with classification categories, or a data frame already initialized by handcode() as input. The function allows users to annotate texts using the pre-defined categories in an interactive Shiny-App and returns a data frame of the texts with their annotations.

In order to demonstrate the functionality of handcode(), we first use the R-package archiveRetriever to download a New York Times article on the presidential debate between Joe Biden and Donald Trump in the 2020 American presidential campaign. We split the article in individual sentences which we can then annotate with handcode().

# Install pacman if not already installed
if(!require(pacman)) install.packages("pacman")

# Use pacman to install and load archiveRetriever and stringr
pacman::p_load(archiveRetriever,
               stringr)

# Use the archiveRetriever to download article
nytimes_article <- scrape_urls(Urls = "http://web.archive.org/web/20201001004918/https://www.nytimes.com/2020/09/30/opinion/biden-trump-2020-debate.html",
                               Paths = c(title = "//h1[@itemprop='headline']",
                                         author = "//span[@itemprop='name']",
                                         date = "//time//text()",
                                         article = "//section[@itemprop='articleBody']//p"))

# Split up the article in different sentences
sentences <- unlist(str_split(nytimes_article$article, pattern = "(?<=(?<!Mr)[\\.!?])\\s"))

head(sentences)
#> [1] "I wasn’t in the crowd of people who believed Joe Biden shouldn’t deign to debate President Trump, but put me in the crowd that believes he shouldn’t debate him again."                                                                                                          
#> [2] "Not after Tuesday night’s horror show: a disgrace to the format, an insult to the country, a nearly pointless 90 minutes."                                                                                                                                                       
#> [3] "And, I should add, a degradation of the presidency itself, which Trump had degraded so thoroughly already."                                                                                                                                                                      
#> [4] "He put on a performance so contemptuous, so puerile, so dishonest and so across-the-board repellent that the moderator, Chris Wallace, morphed into some amalgam of elementary-school principal, child psychologist, traffic cop and roadkill."                                  
#> [5] "No matter how Wallace pleaded with Trump or admonished him, he couldn’t make him behave."                                                                                                                                                                                        
#> [6] "But then why should Wallace have an experience any different from that of Trump’s chiefs of staff, of all the other former administration officials who have fled for the hills, of the Republican lawmakers who just threw up their hands and threw away any scruples they had?"

We can now use these sentences as input in handcode() to annotate the individual sentences of the New York Times article. We will annotate two variables measuring the candidate a sentence talks about and the sentiment of the statement.

annotated <- handcode(data = sentences, 
                      candidate = c("Joe Biden", "Donald Trump"),
                      sentiment = c("positive", "negative"))

If we want to see not only the sentence we are currently coding, but also the surrounding sentences, we can use the option context = TRUE. This gives us our current sentence alongside its previous and following sentence. To not generate any confusion about which sentence is currently evaluated, the surrounding sentences are shown in gray.

annotated <- handcode(data = sentences, 
                      candidate = c("Joe Biden", "Donald Trump"),
                      sentiment = c("positive", "negative"),
                      context = TRUE)

If our text vector does not form a continuous text, but you nonetheless want to provide a previous and next sentence as context, you can also specify a vector with all previous and a vector with all next sentences as pre and post inputs.

# Vectors of all previous and all subsequent sentences
previous <- c("", sentences[2:length(sentences)])
subsequent <- c(sentences[2:length(sentences)-1])

annotated <- handcode(data = sentences,
                      candidate = c("Joe Biden", "Donald Trump"),
                      sentiment = c("positive", "negative"),
                      context = TRUE,
                      pre = previous,
                      post = subsequent)

We can stop the annotation process at any point by clicking on the button save and exit. Once we click this button, the app will close and the function returns a data frame with our texts and annotations.

annotated
#> # A tibble: 60 × 3
#>    texts                                                     candidate sentiment
#>    <chr>                                                     <fct>     <fct>    
#>  1 I wasn’t in the crowd of people who believed Joe Biden s… "Joe Bid… "negativ…
#>  2 Not after Tuesday night’s horror show: a disgrace to the… "Not app… "negativ…
#>  3 And, I should add, a degradation of the presidency itsel… ""        ""       
#>  4 He put on a performance so contemptuous, so puerile, so … ""        ""       
#>  5 No matter how Wallace pleaded with Trump or admonished h… ""        ""       
#>  6 But then why should Wallace have an experience any diffe… ""        ""       
#>  7 Trump runs roughshod over everyone and everything, and o… ""        ""       
#>  8 Almost from the start, he talked over Biden, taunting hi… ""        ""       
#>  9 He interrupted him and interrupted him and then interrup… ""        ""       
#> 10 “Mr. President, I’m the moderator of this debate, and I … ""        ""       
#> # ℹ 50 more rows

We can resume the annotation process at any point by using the returned data frame from our last execution of handcode() as input to a new handcode() command. By default, the function will resume the annotation at the first text that has not been annotated yet.

annotated <- handcode(data = annotated,
                      context = TRUE)

To facilitate the classification process, handcode() takes the keyboard shortcuts space for previous and enter for next. If you go back to already coded lines of your data, the app automatically displays your previous coding, if you go to new lines of your data, the default values for your variables always are ““. If the last row of your data is reached, next automatically leads to the saving of the data and exit from the Shiny-App.

Beyond the basics

By default, handcode uses the first line in the input data that has not been annotated yet as start value. However, the option start allows users to specify with which observation they want to start their coding process. If we have not yet annotated lines of data that lie between already coded lines of data, you can also specify start = "all_empty" to annotate all lines that have not been coded yet in the order in which they appear.

Sometimes, we explicitly want to display texts in a random order to rule out that the context of a text within the larger body of texts influences our annotations. If we want to randomize the order in which texts are displayed, we can set the option randomize = TRUE. This will, however, not influence the order in which texts are sorted in the resulting output.

As a default, handcode will display one missing category “Not applicable”. If you want a different, or more than one missing category, you can provide a character vector of missing categories you would like to have displayed as missing. Missing categories will automatically be displayed in gray. In the output these values will be returned with a leading and trailing _.

annotated <- handcode(data = sentences, 
                      candidate = c("Joe Biden", "Donald Trump"),
                      sentiment = c("positive", "negative"),
                      missing = c("Not applicable", "Undecided"))