quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Performance gains using readr::read_files() #74

Open lmullen opened 7 years ago

lmullen commented 7 years ago

readtext is great. My students will thank you.

For reading in a directory of plain text files, you can get substantial time savings (roughly 30x on my machine) by using readr::read_file() instead of read_lines() and then pasting the lines together.

Benchmarks for smallish corpus:

library(readtext)
library(microbenchmark)

files <- Sys.glob("~/dev/ats-corpus/corpus/*")
length(files)
#> [1] 641

get_texts_readr <- function(files) {
  texts <- vapply(files, readr::read_file, character(1))
  out <- data.frame(text = texts, stringsAsFactors = FALSE)
  class(out) <- c("readtext", "data.frame")
  out
}

microbenchmark(
  readtext_corpus <- readtext(files),
  readr_corpus <- get_texts_readr(files),
  times = 5
)
#> Unit: milliseconds
#>                                    expr        min         lq      mean
#>      readtext_corpus <- readtext(files) 36156.4825 37310.8008 38114.868
#>  readr_corpus <- get_texts_readr(files)   903.6408   906.0704  1041.474
#>      median        uq       max neval cld
#>  38350.7976 38865.321 39890.937     5   b
#>    912.5825  1153.461  1331.615     5  a

str(readtext_corpus)
#> Classes 'readtext' and 'data.frame': 641 obs. of  1 variable:
#>  $ text: chr  "HISTORICAL SKETCH OF THE AMERICAN TRACT SOCIETY. \nThis institution was organized in the year 18U, four years later than the Am"| __truncated__ "P VffDLff^ \n\n\n\n\nLETTERS 5 \n\n\\\\\\ \\ -Si fi;om \n\n) A SENIOR \n\nTO \n\n(4{\\ A JUNIOR PHYSICIAN, \n\n\n\nf \n\n\n\nTH"| __truncated__ "M<g*j§Ylft'gj. \n\n\n\n\n••^aA*)^ \n\n\n\nWHAT \n\n\n\nSHALL I DRINK? \n\n\n\nREUBEN D. MUSSEY, M.D., LLJ). \n\n\n\n\n\n\n\n\n\"| __truncated__ "1854 \n\n\n\n\n\n\n>< \n\n\n\n7^ \n\n\n\n* \n• \n* \n* \n\n\n\n\nSTEPHEN J. W. TABOR. \n\n\n\nOTIUM mm: UTERIS MORS ESI . \n\n\"| __truncated__ ...

str(readr_corpus)
#> Classes 'readtext' and 'data.frame': 641 obs. of  1 variable:
#>  $ text: chr  "HISTORICAL SKETCH OF THE AMERICAN TRACT SOCIETY. \nThis institution was organized in the year 18U, four years later than the Am"| __truncated__ "P VffDLff^ \n\n\n\n\nLETTERS 5 \n\n\\\\\\ \\ -Si fi;om \n\n) A SENIOR \n\nTO \n\n(4{\\ A JUNIOR PHYSICIAN, \n\n\n\nf \n\n\n\nTH"| __truncated__ "M<g*j§Ylft'gj. \n\n\n\n\n••^aA*)^ \n\n\n\nWHAT \n\n\n\nSHALL I DRINK? \n\n\n\nREUBEN D. MUSSEY, M.D., LLJ). \n\n\n\n\n\n\n\n\n\"| __truncated__ "1854 \n\n\n\n\n\n\n>< \n\n\n\n7^ \n\n\n\n* \n• \n* \n* \n\n\n\n\nSTEPHEN J. W. TABOR. \n\n\n\nOTIUM mm: UTERIS MORS ESI . \n\n\"| __truncated__ ...

If you're willing to take a dependency on readr, then I would be happy to send a PR. What do you think?

kbenoit commented 7 years ago

Hi @lmullen, just getting back to this now that I have time. We're also preparing a CRAN release.

I'd love to gain 30x more performance on the most commonly read type of file (text). I have no problem with adding a readr import. If you want to issue a PR with this change, by all means go ahead!

I wonder however how much of the performance is caused by extra readtext() processing, versus the slower readLines() performance. Above you are more comparing a low-level reader to a high-level wrapper around (among other things) the readLines() reader. The only way to tell would be to write a parallel function and compare head-to-head, before killing the slower one off. (There can be only one ⚔️ )

kbenoit commented 7 years ago

I experimented with this in a branch, and it's trickier than it looks. Yes readr::read_file() is faster, but to do it with file-by-file encoding slows down the speed gains considerably (but still 2x faster). However the more difficult problem is that we are then in between the base R encoding (from file()) and the stringi encodings, which are not the same set or the same names. To solve this will involve rebasing the code in a more significant way, also addressing #37.

I'm putting this on the back burner for now, but definitely something to address in the next revision. I also think we can remove the encoding() argument and use readr::guess_encoding() instead. (Both are based on the same underlying stringi function.)

lmullen commented 7 years ago

Thanks for the update, @kbenoit. I was just about to start work on this. Sounds like I should hold off for now, but happy to help out when you say the time is right. Looking forward to your first CRAN release.