Closed krystian8207 closed 5 years ago
The issue here is that by default, readtext()
for a .csv input does two things:
doc_id
variable as a serial number from the document name, with no way to override this as for instance with the docid_field
argument of quanteda::corpus.data.frame(x, docid_field = "...")
argument allows.text
column, which in the case of this file, is actually headed "doc_id". There can be overridden, however, by specifying text_field = "texts"
.For reading single csv files however, there are a number of alternative solutions that can be combined with say quanteda::corpus.data.frame()
to get this easily into a corpus.
library("readtext")
library("quanteda")
## Package version: 1.4.3
# using readtext
rtxt <- readtext("~/Downloads/Inaugural addresses_good.csv", text_field = "texts")
corp1 <- corpus(rtxt, docid_field = "doc_id.1")
summary(corp1, n = 6)
## Corpus consisting of 58 documents, showing 6 documents:
##
## Text Types Tokens Sentences doc_id
## 1789-Washington 625 1540 23 Inaugural addresses_good.csv.1
## 1793-Washington 96 147 4 Inaugural addresses_good.csv.2
## 1797-Adams 826 2578 37 Inaugural addresses_good.csv.3
## 1801-Jefferson 717 1927 41 Inaugural addresses_good.csv.4
## 1805-Jefferson 804 2381 45 Inaugural addresses_good.csv.5
## 1809-Madison 535 1263 21 Inaugural addresses_good.csv.6
## Year President FirstName
## 1789 Washington George
## 1793 Washington George
## 1797 Adams John
## 1801 Jefferson Thomas
## 1805 Jefferson Thomas
## 1809 Madison James
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpYH9BGE/reprex1f1b66ddb539/* on x86_64 by kbenoit
## Created: Mon Jun 24 16:35:28 2019
## Notes:
head(docnames(corp1))
## [1] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson"
## [5] "1805-Jefferson" "1809-Madison"
# using read.csv
rcsv <- read.csv("~/Downloads/Inaugural addresses_good.csv", stringsAsFactors = FALSE)
corp2 <- corpus(rcsv, docid_field = "doc_id", text_field = "texts")
summary(corp2, n = 6)
## Corpus consisting of 58 documents, showing 6 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1538 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2578 37 1797 Adams John
## 1801-Jefferson 717 1927 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2381 45 1805 Jefferson Thomas
## 1809-Madison 535 1263 21 1809 Madison James
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpYH9BGE/reprex1f1b66ddb539/* on x86_64 by kbenoit
## Created: Mon Jun 24 16:35:29 2019
## Notes:
head(docnames(corp2))
## [1] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson"
## [5] "1805-Jefferson" "1809-Madison"
# using data.table::read.csv (faster)
rdt <- data.table::fread("~/Downloads/Inaugural addresses_good.csv")
corp3 <- corpus(rdt, docid_field = "doc_id", text_field = "texts")
summary(corp3, n = 6)
## Corpus consisting of 58 documents, showing 6 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1540 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2578 37 1797 Adams John
## 1801-Jefferson 717 1927 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2381 45 1805 Jefferson Thomas
## 1809-Madison 535 1263 21 1809 Madison James
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpYH9BGE/reprex1f1b66ddb539/* on x86_64 by kbenoit
## Created: Mon Jun 24 16:35:29 2019
## Notes:
head(docnames(corp3))
## [1] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson"
## [5] "1805-Jefferson" "1809-Madison"
@kbenoit Thank you for above suggestions. I think for now we'll use the one with read.csv.
Example file: Inaugural addresses_good.zip While trying to import it without specifying
text_field
, doc_id is created based on file name,text
column becomes original doc_id and columntexts
contains document content.When specifying
text_field
passing exact column name with documents content, we got another weird behavior.doc_id
is built basing on file name, originaldoc_id
is now renamed todoc_id.1
, finally text is stored in correct variable Text.The best option is to use the second situation to create corpus like this:
But we're sure we don't want to store doc_id in here.
I think
readtext
function should be fixed to no not createdoc_id
column if it already exists in sourced file.