ronkeizer / vpc

R library to create visual predictive checks (VPC)
Other
36 stars 20 forks source link

nonmem read speed benchmarks #10

Closed dpastoor closed 9 years ago

dpastoor commented 9 years ago

I did a benchmark of my read_nonmem() and read_nonmem_sed() functions, your read_table_nm with perl, and the fastread read_csv and data.table fread functions. http://rpubs.com/dpastoor/benchmark-nm-read

As you'll see, maybe it was my computer, but the perl version was actually still quite slow even compared to the optimized R-only version (read_nonmem()) but regardless, the fastread functions are 'where it's at' going forward.

My proposal is to drop the perl and sed implementations, and stick to purely a fastread implementation, and for those users that don't have that package, the optimized R version. The question is, where should the read functions live - the VPC package or PKPDmisc or a new package. I would say PKPDmisc as we can use that to dump the miscellaneous functions that we might want to use across various packages (or in this case outside of just creating vpcs), then keep specific packages (a la vpc) as targeted as possible.

ronkeizer commented 9 years ago

Great. Yes, the read_table_nm version as i had in here was just a placeholder, I needed something faster than readLines() and Perl did give me quite an improvement already. But I'm sure this can be improved further, like you showed. Nice to see the comparison with sed and other tools. If fastread can do the parsing on-the-fly to remove the (possibly) multiple header rows in the NONMEM tables than we should definitely go with that. I'm fine with importing an improved table-read function from PKPDmisc, at least if the intention of that package is to keep it lean and focused (but it seems like that).

dpastoor commented 9 years ago

I'll have to look into the parsing a bit more. I know one thing is both fastread and fread both hated nonmem's default 2 space funky output, so I specified comma separation for the nonmem tables and they behave nicely.

The second thing I figured out, is actually we don't "need" to worry about parsing if the user is kind enough to use NOTITLE that will still keep the header columns for each subproblem but drop the non-uniform TABLE 1 which is what wrecks the most havoc.

Then you can read in either as character, strip out the extra subproblem names and then you're off and running or just force numeric from the getgo and at least fastread::read_csv automatically then replaces the names rows with a straight row of all 0's that is easy to filter out as well.

However, all said, there is also the dirty read_lines available so we can do parsing line by line, but I didn't do checks to see how much slower that would be as of yet.

My thought is to give the user two options - use the very fast implementation and have to jump through a small hoop of some minor additions to the $TABLE statement, or be 'satisfied' with the generic read_nonmem that is slower, but is happier to strip the table statements etc.

And as for PKPDmisc - yes it's supposed to be 'only' miscellaneous functions for use in PKPD type work that may apply across multiple packages. Any time I'd do something larger I'd go for a specific package.

So (as you can see) it has a grab-bag of smaller things like reading/writing, report folder structure, a gradient checker, etc. Feel free to dump any other extraneous functions you have in there as well.

ronkeizer commented 9 years ago

OK, wasn't aware of the NOTITLE option. I see your point but I really prefer to have a solution that works out-of-the box for all tables, also those specified without NOTITLE. Why don't we do like this:

try fastread() on the table { if (fastread works) { Perfect, just strip off some stuff and done. } if (fastread fails) {

Then the user doesn't actively have to choose. It should be easy to capture the failing of fastread. And I'm guessing if it fails, it will fail fast, not after 5 seconds or so.

Sure, might add a few things to PKPDmisc from my scripts once I have time, there's a few things that could be useful.

dpastoor commented 9 years ago

Yeah I definitely get your point, and completely agree - I think the big 'kicker' will be benchmarking how much time the error checking takes. Worst case, if it take a while we can wrap the whole thing in in a check_errors=T so if someone 'knows what they're doing' and has the proper specifications they can turn off error checking.

I'll play around with the read_lines and some other possibilities for error checking and see the various ways I can break read_csv based on how nonmem is set up. The biggest hurdle as of now is neither fastread implementation plays well with the space separation (for the moment at least) that nonmem defaults to.

One thing I'll try to work on as the package develops is good documentation so we can also try to just better inform the user from the get-go if they want to get the best performance.