ncss-tech / aqp

Algorithms for Quantitative Pedology
http://ncss-tech.github.io/aqp/
54 stars 14 forks source link

Phase-in of data.table support in SPC objects and functions returning SPC objects #157

Open dylanbeaudette opened 4 years ago

dylanbeaudette commented 4 years ago

Keeping track of changes / testing related to data.table replacement of plyr / reshape functionality. Each change should be checked and a new test added if there isn't sufficient coverage. See related issue in soilDB and sharpshootR.

plyr

joinbase::merge(...,incomparables = NA, sort = FALSE) or data.table join

ddply

llply

ldply

dlply

reshape

castdata.table::dcast

Note that value argument becomes value.var.

meltdata.table::melt

If the object is not a data.table then perform on-the-fly conversion to/from. The following works in plotColorQuantiles:

as.data.frame(
data.table::melt(
as.data.table(res$marginal), 
id.var = c('p', 'L_colors', 'A_colors', 'B_colors', 'L_chip', 'A_chip', 'B_chip')
)
)

However, there following appears with a similar conversion in evalGenHz:

In melt.data.table(as.data.table(h), id.vars = genhz, measure.vars = c(vars,  :
  'measure.vars' [prop, sil.width, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

Which is not a problem, and can be suppressed for now. I've added a TODO item for converting all affected columns to numeric.

Caveats / Notes

I'm fine with the current trajectory as long as we don't (yet) guarantee class-in == class-out for all functions in aqp.

A Plan

Related issues

brownag commented 4 years ago

I will dump out a bunch of ideas, work that has been done so far, and some of the mysteries I am yet to figure out.

Extending and testing the SoilProfileCollection object

Most S4 methods that are fundamental to the SPC class and its basic operations (including subsetting, slot access, joins etc.) now respect the three different classes. It is a bit difficult to draw the conceptual line where "basic operations" stop and where "complicated" operations start... but here is an attempt:

To my knowledge there are three that use the constructor: [,SoilProfileCollection-method, .initSPCfromMF/depths<-, and permute_profile. permute_profile is a relatively new one of mine, and should be converted to use depths<- or possibly further refactored. I don't think there is anything in there that warrants using the constructor rather than depths + other setter methods/joins.

Methods have the ability to:

  1. optional check class of SPC with aqp_df_class(object)
  2. optional: error (e.g. data.table required but not available)
  3. optional: convert intermediates to an internal class (essentially "pick one to convert to/from")
  4. unambiguous code or switch statement on classes

There are several methods out there that do not have an method signature to ensure SPC-input-relationship. Also, there are several methods that have an option to return SPCs that do not currently respect these features (return data.frame SPCs no matter what input). The latter methods only have a mechanism to "respect" the aqp_df_class if they receive an SPC as input (i.e. they need the metadata in addition to the data) OR if the class is given as an argument/calculated internally from something else.

data.table and tbl_df awareness

Most of the work of data.table and tbl_df inheritance is done automatically by defining their "old classes" using setOldClass(). This allows things like merge(x,y) to automatically trigger merge.data.table, for instance, if x is data.table. This allows much of the important code for e.g. joins inside the object to be written in an unambiguous way with little effort / conceptual overhead.

There are two special internal aqp methods for working with slots that may contain data.frame-like classes

.as.data.frame.aqp -- convert a data.frame-like object to target class

.data.frame.j -- subset a data.frame-like object using a vector of column names in the j-index

object <- fetchSPCSomehow()
cn <- c("column1", "column2")

# dat could contain data.frame, data.table or tbl_df class at this point
dat <- horizons(object)

# inspect
print(aqp_df_class(object))

Working with data.table object j index subsetting in an unambiguous way requires that one rewrite:

# 1. allowed for data.frame and tbl_df but NOT data.table
dat[,cn]

as

# 2. allowed for all three classes
.data.frame.j(dat, cn, aqp_df_class(object))

Anywhere one is using NUMERIC j index should be refactored to use column names (which are much less ambiguous/prone to error). This is generally good practice even for pure data.frames, IMO, and should be easy (index + names attribute).

Where appropriate we should consider defining functions that take an SPC object as input as S4 methods with SPC method signature. I also find that methods that return SPCs are really handy for %>%-based workflows.

I don't think we should overdo it with wrapper methods, but I also think that the point of S4 is at least in part to dispatch methods for our special class and make sure things are handled safely (and ideally easily for the user). So if there is a major benefit to getting a particular aqp functions' output into SPC, lets define that method it in aqp, rather than leaving the user to figure it out.

Wrappers for SPC object

The gist below shows a basic wrapper around munsell2rgb. I want to highlight how handling of data.frames for slots within the SPC works in series with on-the-fly data.table implementations for specific functions.

https://gist.github.com/brownag/e57cb66e32ea0842b7e507a18e228906

The wrapper method could be dispatched via S4 (e.g. munsell2rgb,SoilProfileCollection-method) so one could use the 3-vector or the SPC interface to munsell2rgb (using the same function name, different method signature).

On the fly

However, there is a ton of heavy lifting that goes on inside munsell2rgb(hue, value, chroma, ...) that could probably be optimized using data.table. In the current system where data.table is in Suggests one would add a requireNamespace in the relevant scope to trigger (or error) for the data.table code.

Its not clear how much more effort should be put into retaining historic data.frame implementations where data.table is obviously better or dramatically simplifies the code. I have no good suggestions at this point, but munsell2rgb may pose a really great case study for how to convert over a pretty complex internal method that relies heavily on the data.frame object. While I have said we should retain base R (I am mostly referring to the basic methods/within the SPC class) functionality, that of course has limits as you have pointed out. I will not lose any sleep if e.g. we omit the plyr::rbind.fill from union and make no attempt to write that mess in base R.

The examples I have done so far have been relatively simple to retain old data.frame functionality -- but munsell2rgb is considerably more complicated. A wholesale rewrite using data.table (the slice->dice) approach may be most appropriate -- I need to study how it works more and think on the options that are available.

Even without internal enhancements, there is value in having the wrapper method from above gist around it to handle merging of result back into SPC for the user and adding horizon/site IDs.

Keys

For some operations it is plainly evident that it is worth applying keys -- even on the fly. I think there is nothing wrong with setting key on the fly for operations like dice -- where there is tremendous benefit to be gained through the course of that function call -- even if the key was "thrown out" at the end.

I am cautious about more broad handling of keys ... i.e. within SPC methods like [ ... mostly because of my limited understanding.

E.g. in [... I explicitly constrain using by= I think it may be better to set the key in addition, but I've had mixed results testing this ... that led me to believe that they were performing essentially the same. I wasn't sure whether the key was being ignored or what.

In contrast, I was very surprised to see how badly the data.table code in min/max performed, and an 8x difference between keyed and unkeyed. I wondered whether I messed something in the DT syntax, making it way more costly -- since tapply weighed in at only 3x slower -- it didn't seem possible it could be that bad (or that tapply could be that good, rather). It is important to remember that a lot of the base R stuff is highly optimized!!

Clearly me "doing it wrong" is the most likely explanation.

I still need to experiment a lot more with this -- at this point I was elated to have something that worked let alone be several fold faster across the board. Figuring out the right "data.table" way to do things will be the needed for unlocking the final gains in performance and solving these mysteries. Before I started this I knew next to nothing about data.table. Now I feel like I have a pretty good handle on how it is different -- but am a total noob when it comes to effectively rewriting common base R workflows. I try many things and get many verbose errors from the data.table class... thankfully the error messages are pretty informative and there are many many guides and examples out there. It is clearly a very well thought out and powerful platform -- though IMO not for the uninitiated.

brownag commented 3 years ago

Forgot to tag this issue in recent commits adding the data.table import

  1. https://github.com/ncss-tech/aqp/commit/f56c8af6db6ea43f1fc376db53dedfb49e1d8993
  2. https://github.com/ncss-tech/aqp/commit/e1438b0b56c83d5b7935e79644ec91612b0dd993

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html

dylanbeaudette commented 3 years ago

EDIT: @brownag moved checkboxes to OP so that the progress tracking shows up in the issue

brownag commented 3 years ago

Probably related to merge.data.table downstream side-effects: https://github.com/ncss-tech/soilDB/issues/158 Above statement and below mention was a false alarm

brownag commented 3 years ago

moved checkboxes to OP so that the progress tracking shows up in the issue