doc_id column should be used from csv if this exists

krystian8207 commented 5 years ago

Example file: Inaugural addresses_good.zip While trying to import it without specifying text_field, doc_id is created based on file name, text column becomes original doc_id and column texts contains document content.

text <- readtext("Inaugural addresses_good.csv")
head(text)
readtext object consisting of 6 documents and 4 docvars.
# data.frame [6 × 6]
  doc_id                  text            texts                                                                                                                                                Year President FirstName
* <chr>                   <chr>           <chr>                                                                                                                                               <int> <chr>     <chr>    
1 Inaugural addresses_go… "\"1789-Washi\… "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with …  1789 Washingt… George   
2 Inaugural addresses_go… "\"1793-Washi\… "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for…  1793 Washingt… George   
3 Inaugural addresses_go… "\"1797-Adams\… "When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a foreign legislature and…  1797 Adams     John     
4 Inaugural addresses_go… "\"1801-Jeffe\… "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our country, I avail myself of the presence …  1801 Jefferson Thomas   
5 Inaugural addresses_go… "\"1805-Jeffe\… "Proceeding, fellow citizens, to that qualification which the Constitution requires before my entrance on the charge again conferred on me, it is …  1805 Jefferson Thomas   
6 Inaugural addresses_go… "\"1809-Madis\… "Unwilling to depart from examples of the most revered authority, I avail myself of the occasion now presented to express the profound impression …  1809 Madison   James

When specifying text_field passing exact column name with documents content, we got another weird behavior. doc_id is built basing on file name, original doc_id is now renamed to doc_id.1, finally text is stored in correct variable Text.

library(readtext) 
text <- readtext("Inaugural addresses_good.csv", text_field = "texts") 
head(text)
readtext object consisting of 6 documents and 4 docvars.
# data.frame [6 × 6]
  doc_id                         text                doc_id.1         Year President  FirstName
* <chr>                          <chr>               <chr>           <int> <chr>      <chr>    
1 Inaugural addresses_good.csv.1 "\"Fellow-Cit\"..." 1789-Washington  1789 Washington George   
2 Inaugural addresses_good.csv.2 "\"Fellow cit\"..." 1793-Washington  1793 Washington George   
3 Inaugural addresses_good.csv.3 "\"When it wa\"..." 1797-Adams       1797 Adams      John     
4 Inaugural addresses_good.csv.4 "\"Friends an\"..." 1801-Jefferson   1801 Jefferson  Thomas   
5 Inaugural addresses_good.csv.5 "\"Proceeding\"..." 1805-Jefferson   1805 Jefferson  Thomas   
6 Inaugural addresses_good.csv.6 "\"Unwilling \"..." 1809-Madison     1809 Madison    James

The best option is to use the second situation to create corpus like this:

> corp <- corpus(text, docid_field = "doc_id.1")
> head(summary(corp))
             Text Types Tokens Sentences                         doc_id Year  President FirstName
1 1789-Washington   625   1540        23 Inaugural addresses_good.csv.1 1789 Washington    George
2 1793-Washington    96    147         4 Inaugural addresses_good.csv.2 1793 Washington    George
3      1797-Adams   826   2578        37 Inaugural addresses_good.csv.3 1797      Adams      John
4  1801-Jefferson   717   1927        41 Inaugural addresses_good.csv.4 1801  Jefferson    Thomas
5  1805-Jefferson   804   2381        45 Inaugural addresses_good.csv.5 1805  Jefferson    Thomas
6    1809-Madison   535   1263        21 Inaugural addresses_good.csv.6 1809    Madison     James

But we're sure we don't want to store doc_id in here.

I think readtext function should be fixed to no not create doc_id column if it already exists in sourced file.

kbenoit commented 5 years ago

The issue here is that by default, readtext() for a .csv input does two things:

creates a doc_id variable as a serial number from the document name, with no way to override this as for instance with the docid_field argument of quanteda::corpus.data.frame(x, docid_field = "...") argument allows.
assigns the first column as the text column, which in the case of this file, is actually headed "doc_id". There can be overridden, however, by specifying text_field = "texts".

For reading single csv files however, there are a number of alternative solutions that can be combined with say quanteda::corpus.data.frame() to get this easily into a corpus.

library("readtext")
library("quanteda")
## Package version: 1.4.3

# using readtext
rtxt <- readtext("~/Downloads/Inaugural addresses_good.csv", text_field = "texts")
corp1 <- corpus(rtxt, docid_field = "doc_id.1")
summary(corp1, n = 6)
## Corpus consisting of 58 documents, showing 6 documents:
## 
##             Text Types Tokens Sentences                         doc_id
##  1789-Washington   625   1540        23 Inaugural addresses_good.csv.1
##  1793-Washington    96    147         4 Inaugural addresses_good.csv.2
##       1797-Adams   826   2578        37 Inaugural addresses_good.csv.3
##   1801-Jefferson   717   1927        41 Inaugural addresses_good.csv.4
##   1805-Jefferson   804   2381        45 Inaugural addresses_good.csv.5
##     1809-Madison   535   1263        21 Inaugural addresses_good.csv.6
##  Year  President FirstName
##  1789 Washington    George
##  1793 Washington    George
##  1797      Adams      John
##  1801  Jefferson    Thomas
##  1805  Jefferson    Thomas
##  1809    Madison     James
## 
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpYH9BGE/reprex1f1b66ddb539/* on x86_64 by kbenoit
## Created: Mon Jun 24 16:35:28 2019
## Notes:
head(docnames(corp1))
## [1] "1789-Washington" "1793-Washington" "1797-Adams"      "1801-Jefferson" 
## [5] "1805-Jefferson"  "1809-Madison"

# using read.csv
rcsv <- read.csv("~/Downloads/Inaugural addresses_good.csv", stringsAsFactors = FALSE)
corp2 <- corpus(rcsv, docid_field = "doc_id", text_field = "texts")
summary(corp2, n = 6)
## Corpus consisting of 58 documents, showing 6 documents:
## 
##             Text Types Tokens Sentences Year  President FirstName
##  1789-Washington   625   1538        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2578        37 1797      Adams      John
##   1801-Jefferson   717   1927        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson    Thomas
##     1809-Madison   535   1263        21 1809    Madison     James
## 
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpYH9BGE/reprex1f1b66ddb539/* on x86_64 by kbenoit
## Created: Mon Jun 24 16:35:29 2019
## Notes:
head(docnames(corp2))
## [1] "1789-Washington" "1793-Washington" "1797-Adams"      "1801-Jefferson" 
## [5] "1805-Jefferson"  "1809-Madison"

# using data.table::read.csv (faster)
rdt <- data.table::fread("~/Downloads/Inaugural addresses_good.csv")
corp3 <- corpus(rdt, docid_field = "doc_id", text_field = "texts")
summary(corp3, n = 6)
## Corpus consisting of 58 documents, showing 6 documents:
## 
##             Text Types Tokens Sentences Year  President FirstName
##  1789-Washington   625   1540        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2578        37 1797      Adams      John
##   1801-Jefferson   717   1927        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson    Thomas
##     1809-Madison   535   1263        21 1809    Madison     James
## 
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpYH9BGE/reprex1f1b66ddb539/* on x86_64 by kbenoit
## Created: Mon Jun 24 16:35:29 2019
## Notes:
head(docnames(corp3))
## [1] "1789-Washington" "1793-Washington" "1797-Adams"      "1801-Jefferson" 
## [5] "1805-Jefferson"  "1809-Madison"

krystian8207 commented 5 years ago

@kbenoit Thank you for above suggestions. I think for now we'll use the one with read.csv.

quanteda / readtext

doc_id column should be used from csv if this exists #155