Open lmullen opened 7 years ago
Hi @lmullen, just getting back to this now that I have time. We're also preparing a CRAN release.
I'd love to gain 30x more performance on the most commonly read type of file (text). I have no problem with adding a readr import. If you want to issue a PR with this change, by all means go ahead!
I wonder however how much of the performance is caused by extra readtext()
processing, versus the slower readLines()
performance. Above you are more comparing a low-level reader to a high-level wrapper around (among other things) the readLines()
reader. The only way to tell would be to write a parallel function and compare head-to-head, before killing the slower one off. (There can be only one ⚔️ )
I experimented with this in a branch, and it's trickier than it looks. Yes readr::read_file()
is faster, but to do it with file-by-file encoding slows down the speed gains considerably (but still 2x faster). However the more difficult problem is that we are then in between the base R encoding (from file()
) and the stringi encodings, which are not the same set or the same names. To solve this will involve rebasing the code in a more significant way, also addressing #37.
I'm putting this on the back burner for now, but definitely something to address in the next revision. I also think we can remove the encoding()
argument and use readr::guess_encoding()
instead. (Both are based on the same underlying stringi function.)
Thanks for the update, @kbenoit. I was just about to start work on this. Sounds like I should hold off for now, but happy to help out when you say the time is right. Looking forward to your first CRAN release.
readtext is great. My students will thank you.
For reading in a directory of plain text files, you can get substantial time savings (roughly 30x on my machine) by using
readr::read_file()
instead ofread_lines()
and then pasting the lines together.Benchmarks for smallish corpus:
If you're willing to take a dependency on readr, then I would be happy to send a PR. What do you think?