quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Sourcing doc_id does not work for 1-row tabular files #160

Open krystian8207 opened 4 years ago

krystian8207 commented 4 years ago

Let's create example files:

csv1 <- data.frame(
  doc_id = c("doc1", "doc2"),
  text = c("Lorem ipsum", "dolor sit amet"),
  docvar1 = c("A", "B"),
  docvar2 = c("C", "D"),
  stringsAsFactors = FALSE
)
csv2 <- csv1[1, ]
write.csv(csv1, file = "/tmp/csv1.csv", row.names = FALSE)
write.csv(csv2, file = "/tmp/csv2.csv", row.names = FALSE)

For csv1.csv doc_id and text are sourced correctly:

> readtext::readtext("/tmp/csv1.csv", docid_field = "doc_id", text_field = "text")
readtext object consisting of 2 documents and 2 docvars.
# Description: df[,4] [2 × 4]
  doc_id text                docvar1 docvar2
  <chr>  <chr>               <chr>   <chr>  
1 doc1   "\"Lorem ipsu\"..." A       C      
2 doc2   "\"dolor sit \"..." B       D  

For csv2.csv doc_id is based on filename:

> readtext::readtext("/tmp/csv2.csv", docid_field = "doc_id", text_field = "text")
readtext object consisting of 1 document and 2 docvars.
# Description: df[,4] [1 × 4]
  doc_id   text                docvar1 docvar2
  <chr>    <chr>               <chr>   <chr>  
1 csv2.csv "\"Lorem ipsu\"..." A       C