faster reader in cpp - Githubissues

dpastoor commented 9 years ago

Hey Ron,

Just wanted to give you an FYI, I've been messing with a lot of ways of getting data as quickly (and painlessly) into R - tried julia/python and other solutions. Finally bit the bullet and went cpp with Rcpp to reduce dependencies. It's ~4x faster than the perl implementation, and 13x faster than the native xpose solution or the non-perl implementation.

This was for an ~20 MB file. For an ~120 MB file it comes out to ~4 seconds vs ~17 seconds

I will try to clean up and get the associated libs correct (Rcpp has some setup to make work in a package vs just sourcing the Cpp file interactively to test), and would really appreciate if you/your group could give it a run.

dpastoor commented 9 years ago

@ronkeizer just updated PKPDmisc to include it. Function is read_nonmem()

Let me know if you/anyone in your group has a chance to use it.

ronkeizer commented 9 years ago

Nice. Did a quick test: one dataset indeed seemed a lot faster than Perl, will do a more quantitative test later. Another one threw an error, though. Will try to find where the second one was different.

After loading the lib from github it worked out of the box for me. But I already have the boost C++ libraries installed, not sure if this will install as easy on a blank system?

Q: This seems to be implemented as a pre-parser before the read_csv() from hadley, right? Wouldn't it be even faster to fork read_csv() and put in the string formatting somewhere in that code? Then in principle you can loop over the lines just once? Not sure how read_csv is organized though and how much work it would be.

dpastoor commented 9 years ago

BTW I realized a bug (I was replacing all double white spaces with "," but with negative values it messed up. I just pushed a new version that properly handles that. That could be your error dataset.

Regarding the fork, I think it'd be a massive amount of work for only a small gain. The true 'fastest' implementation is to actually destructively parse the dataset the first time and clear out all the TABLE statements etc, and overwrite the existing data, then on subsequent calls truly use actually fread in data.table (hadley said fread will always be faster due to the way they are implemented).

This is the most reasonably conservative approach without messing with the original data or completely writing out our own parser due to the messy nature.

I pinged hadley here to see if he has any suggestions.

If that dataset that threw an error still doesn't work with the updated code, if you can post a snippet I can try to track it down.

dpastoor commented 9 years ago

oh and to directly answer the pre-parser part, yes, read_csv can either take a csv file or a single string with line separators. Since we have to read_lines to actually handle the parsing of the table and column rows, this is the fastest way I can think of to get it into an existing parser without having a temp file written, which negates most of the speed gain anyway. Using the temp-file method or text connection its approximately the same speed as the perl implementation (which in itself is also an issue since a lot of windows people don't have perl installed if they are using NM on a cluster.

dpastoor commented 9 years ago

oh yeah and anyone that has Rcpp installed will almost guaranteedly have boost. I believe dplyr also uses boost a bit, so as long as a person has installed dplyr, this shouldn't introduce any additional dependencies.

dpastoor commented 9 years ago

ok just checked on a windows computer and fixed more bugs, and on enough datasets to consider it ready for 'prime-time' of other people using it. Also added a Header argument in case someone uses the NOTITLE option

dpastoor commented 9 years ago

@ronkeizer last thing, hadley just stabilized and set read_csv to just use file not text so update both PKPDmisc and readr for those changes or you'll probably get an error. Going to close this so if you have any other questions just open an issue on PKPDmisc

ronkeizer / vpc

faster reader in cpp #26