ropensci / neotoma

Programmatic R interface to the Neotoma Paleoecological Database.
https://docs.ropensci.org/neotoma
Other
30 stars 16 forks source link

Speed-up get_download, fix #21, and a range of other improvements #22

Closed gavinsimpson closed 11 years ago

gavinsimpson commented 11 years ago

I was looking into the speed of get_download() as it seemed to take a while to process the object returned from the Neotoma query: this was taking as long if not longer than the query itself.

The main speed issue was the two calls to l*ply() to process the query data into a long data frame containing all the count information resulting in sample.data in the code. plyr is not know for its speed, rather it is a user-friendly interface to the split-apply-combine idiom.

I was able to replace those two l*ply() calls with some base R code that is significantly quicker. This also does away with the suppressMessages() call which is dangerous as it zaps all messages not just the ones you want to hide from melt().

I also changed some of the other code producing the other data frames returned:

  1. First I removed the with() statement - it is generally best to avoid these sugar functions in code as they are designed for use at the top-level (global environment) and weird things can happen if the evaluation frame isn't where you expect it to be. Replaced with direct indexing.
  2. Simplified creation of site.data to a single call
  3. Simplified creation of sample.meta. Reduced the six sapply() or ldply() to three equivalent calls. Advantages here are the reduced number of calls, and processing of sample.name can be done in a vectorised fashion rather than a lot of if else calls.

Whilst I was at this I also addressed #21 by changing the xtabs() call to a call to reshape2::dcast() call which results in a data frame to return as counts instead of the "xtabs" classed object.

Other fixes:

  1. You don't need to process the function call to check for presence etc of arguments. Here, as there is only a single main argument, it is much easier to just use missing(). datasetid is required otherwise the function fails. Here I check if it is missing & return an error message if it is. If supplied, I then do the checks as before, adding one for multiple supplied values in datasetid.
  2. There was a bug in the creation of sample.meta. This line:

    thickness = sapply(samples, function(x) x$AnalysisUnitName)

    in the data.frame() call was addressing the wrong component AnalysisUnitName. It should have been addressing AnalysisUnitThickness.

  3. A the end of the original version you had

    if(class(aa) == 'try-error') aa <- aa

    which is redundant as aa will still be of class "try-error" as the two previous if clauses would not have been entered into. Now I simply check immediately if there was an error and bail out.

  4. I added a verbose argument defaulting to TRUE to maintain backwards compatibility. This allows the API worked message to be skipped if the user wants.
SimonGoring commented 11 years ago

Thanks Gavin, looks great!