I was looking into the speed of get_download() as it seemed to take a while to process the object returned from the Neotoma query: this was taking as long if not longer than the query itself.
The main speed issue was the two calls to l*ply() to process the query data into a long data frame containing all the count information resulting in sample.data in the code. plyr is not know for its speed, rather it is a user-friendly interface to the split-apply-combine idiom.
I was able to replace those two l*ply() calls with some base R code that is significantly quicker. This also does away with the suppressMessages() call which is dangerous as it zaps all messages not just the ones you want to hide from melt().
I also changed some of the other code producing the other data frames returned:
First I removed the with() statement - it is generally best to avoid these sugar functions in code as they are designed for use at the top-level (global environment) and weird things can happen if the evaluation frame isn't where you expect it to be. Replaced with direct indexing.
Simplified creation of site.data to a single call
Simplified creation of sample.meta. Reduced the six sapply() or ldply() to three equivalent calls. Advantages here are the reduced number of calls, and processing of sample.name can be done in a vectorised fashion rather than a lot of if else calls.
Whilst I was at this I also addressed #21 by changing the xtabs() call to a call to reshape2::dcast() call which results in a data frame to return as counts instead of the "xtabs" classed object.
Other fixes:
You don't need to process the function call to check for presence etc of arguments. Here, as there is only a single main argument, it is much easier to just use missing(). datasetid is required otherwise the function fails. Here I check if it is missing & return an error message if it is. If supplied, I then do the checks as before, adding one for multiple supplied values in datasetid.
There was a bug in the creation of sample.meta. This line:
in the data.frame() call was addressing the wrong component AnalysisUnitName. It should have been addressing AnalysisUnitThickness.
A the end of the original version you had
if(class(aa) == 'try-error') aa <- aa
which is redundant as aa will still be of class "try-error" as the two previous if clauses would not have been entered into. Now I simply check immediately if there was an error and bail out.
I added a verbose argument defaulting to TRUE to maintain backwards compatibility. This allows the API worked message to be skipped if the user wants.
I was looking into the speed of
get_download()
as it seemed to take a while to process the object returned from the Neotoma query: this was taking as long if not longer than the query itself.The main speed issue was the two calls to
l*ply()
to process the query data into a long data frame containing all the count information resulting insample.data
in the code. plyr is not know for its speed, rather it is a user-friendly interface to the split-apply-combine idiom.I was able to replace those two
l*ply()
calls with some base R code that is significantly quicker. This also does away with thesuppressMessages()
call which is dangerous as it zaps all messages not just the ones you want to hide frommelt()
.I also changed some of the other code producing the other data frames returned:
with()
statement - it is generally best to avoid these sugar functions in code as they are designed for use at the top-level (global environment) and weird things can happen if the evaluation frame isn't where you expect it to be. Replaced with direct indexing.site.data
to a single callsample.meta
. Reduced the sixsapply()
orldply()
to three equivalent calls. Advantages here are the reduced number of calls, and processing ofsample.name
can be done in a vectorised fashion rather than a lot ofif else
calls.Whilst I was at this I also addressed #21 by changing the
xtabs()
call to a call toreshape2::dcast()
call which results in a data frame to return ascounts
instead of the"xtabs"
classed object.Other fixes:
missing()
.datasetid
is required otherwise the function fails. Here I check if it is missing & return an error message if it is. If supplied, I then do the checks as before, adding one for multiple supplied values indatasetid
.There was a bug in the creation of
sample.meta
. This line:in the
data.frame()
call was addressing the wrong componentAnalysisUnitName
. It should have been addressingAnalysisUnitThickness
.A the end of the original version you had
which is redundant as
aa
will still be of class"try-error"
as the two previousif
clauses would not have been entered into. Now I simply check immediately if there was an error and bail out.verbose
argument defaulting toTRUE
to maintain backwards compatibility. This allows the API worked message to be skipped if the user wants.