Phase-in of data.table support in SPC objects and functions returning SPC objects

dylanbeaudette commented 4 years ago

Keeping track of changes / testing related to data.table replacement of plyr / reshape functionality. Each change should be checked and a new test added if there isn't sufficient coverage. See related issue in soilDB and sharpshootR.

`plyr`

`join` → `base::merge(...,`incomparables = NA`, sort = FALSE)` or `data.table` join

[x] SoilProfileCollection-coercion
[x] SoilProfileCollection-slice-methods (deprecated, in favor of dice)
[x] munsell2rgb

`ddply`

[x] SoilProfileCollection-setters
[x] aggregateColor
[x] aggregateSoilDepth
[x] getmlhz
[x] missingDataGrid

`llply`

[ ] profile_compare

`ldply`

[x] SoilProfileCollection-slice-methods (deprecated, in favor of dice)
[x] aggregateColor
[x] soilColorSignature

`dlply`

[x] aggregateColor
[x] profile_compare (deprecated, in favor of NCSP)

`reshape`

`cast` → `data.table::dcast`

[x] evalGenHz
[x] soilColorSignature

Note that value argument becomes value.var.

`melt` → `data.table::melt`

[x] plotColorQuantiles
[x] evalGenHz
[x] slab
[x] slab examples will need reshape package alternatives
[x] soilColorSignature

If the object is not a data.table then perform on-the-fly conversion to/from. The following works in plotColorQuantiles:

as.data.frame(
data.table::melt(
as.data.table(res$marginal), 
id.var = c('p', 'L_colors', 'A_colors', 'B_colors', 'L_chip', 'A_chip', 'B_chip')
)
)

However, there following appears with a similar conversion in evalGenHz:

In melt.data.table(as.data.table(h), id.vars = genhz, measure.vars = c(vars,  :
  'measure.vars' [prop, sil.width, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

Which is not a problem, and can be suppressed for now. I've added a TODO item for converting all affected columns to numeric.

Caveats / Notes

It appears that data.table::melt will not preserve factor type / levels unless asked to do so explicitly via value.factor argument. A reasonable solution is to set it after the reshape operation when appropriate.
id.vars should not contain duplicated column names
use eval to dereference variables inside of data.table[ , ... ]

I'm fine with the current trajectory as long as we don't (yet) guarantee class-in == class-out for all functions in aqp.

A Plan

continue with data.frame, tbl_df, data.table end-to-end support for functions using / returning SoilProfileCollection objects
improve test coverage all around
add on-the-fly data.table usage (results are still data.frame) to remaining functions where it could give significant performance gain (slice, slab, munsell2rgb, etc.)
test, test, test
push to CRAN and document that functions using / returning SoilProfileCollection objects will respect input table class
(time passes)
attend to soilDB
document class-safe function template (j-indexing, use of keys, ???)
finish DT-ification progressively---if we remain happy with that approach

Related issues

3
79
114
115
118
129
134
135

brownag commented 4 years ago

I will dump out a bunch of ideas, work that has been done so far, and some of the mysteries I am yet to figure out.

Extending and testing the SoilProfileCollection object

Most S4 methods that are fundamental to the SPC class and its basic operations (including subsetting, slot access, joins etc.) now respect the three different classes. It is a bit difficult to draw the conceptual line where "basic operations" stop and where "complicated" operations start... but here is an attempt:

Any method that takes SoilProfileCollection objects as input should be "aware" that the slots could be data.frame subclasses
Any method that builds an SPC (from depths<-) should be aware that the input class will set the metadata (and all other df classes) for the object
Any method that uses the SoilProfileCollection() constructor needs to make sure that the relevant values are set (I would not encourage anyone to use this. The less prevalent its use the better.)

To my knowledge there are three that use the constructor: [,SoilProfileCollection-method, .initSPCfromMF/depths<-, and permute_profile. permute_profile is a relatively new one of mine, and should be converted to use depths<- or possibly further refactored. I don't think there is anything in there that warrants using the constructor rather than depths + other setter methods/joins.

Methods have the ability to:

optional check class of SPC with aqp_df_class(object)
optional: error (e.g. data.table required but not available)
optional: convert intermediates to an internal class (essentially "pick one to convert to/from")
unambiguous code or switch statement on classes

There are several methods out there that do not have an method signature to ensure SPC-input-relationship. Also, there are several methods that have an option to return SPCs that do not currently respect these features (return data.frame SPCs no matter what input). The latter methods only have a mechanism to "respect" the aqp_df_class if they receive an SPC as input (i.e. they need the metadata in addition to the data) OR if the class is given as an argument/calculated internally from something else.

data.table and tbl_df awareness

Most of the work of data.table and tbl_df inheritance is done automatically by defining their "old classes" using setOldClass(). This allows things like merge(x,y) to automatically trigger merge.data.table, for instance, if x is data.table. This allows much of the important code for e.g. joins inside the object to be written in an unambiguous way with little effort / conceptual overhead.

There are two special internal aqp methods for working with slots that may contain data.frame-like classes

object is the SPC
target_class is a character vector unit length e.g. "data.frame" [default/failsafe] "data.table" or "tbl_df."
coluimn_names is a vector of column names (as would be used in the data.frame column or "j"-index)

.as.data.frame.aqp -- convert a data.frame-like object to target class

.as.data.frame.aqp(object, target_class)

.data.frame.j -- subset a data.frame-like object using a vector of column names in the j-index

.data.frame.j(object, column_names, target_class)

object <- fetchSPCSomehow()
cn <- c("column1", "column2")

# dat could contain data.frame, data.table or tbl_df class at this point
dat <- horizons(object)

# inspect
print(aqp_df_class(object))

Working with data.table object j index subsetting in an unambiguous way requires that one rewrite:

# 1. allowed for data.frame and tbl_df but NOT data.table
dat[,cn]

as

# 2. allowed for all three classes
.data.frame.j(dat, cn, aqp_df_class(object))

Anywhere one is using NUMERIC j index should be refactored to use column names (which are much less ambiguous/prone to error). This is generally good practice even for pure data.frames, IMO, and should be easy (index + names attribute).

Where appropriate we should consider defining functions that take an SPC object as input as S4 methods with SPC method signature. I also find that methods that return SPCs are really handy for %>%-based workflows.

I don't think we should overdo it with wrapper methods, but I also think that the point of S4 is at least in part to dispatch methods for our special class and make sure things are handled safely (and ideally easily for the user). So if there is a major benefit to getting a particular aqp functions' output into SPC, lets define that method it in aqp, rather than leaving the user to figure it out.

Wrappers for SPC object

The gist below shows a basic wrapper around munsell2rgb. I want to highlight how handling of data.frames for slots within the SPC works in series with on-the-fly data.table implementations for specific functions.

https://gist.github.com/brownag/e57cb66e32ea0842b7e507a18e228906

The wrapper method could be dispatched via S4 (e.g. munsell2rgb,SoilProfileCollection-method) so one could use the 3-vector or the SPC interface to munsell2rgb (using the same function name, different method signature).

On the fly

However, there is a ton of heavy lifting that goes on inside munsell2rgb(hue, value, chroma, ...) that could probably be optimized using data.table. In the current system where data.table is in Suggests one would add a requireNamespace in the relevant scope to trigger (or error) for the data.table code.

Its not clear how much more effort should be put into retaining historic data.frame implementations where data.table is obviously better or dramatically simplifies the code. I have no good suggestions at this point, but munsell2rgb may pose a really great case study for how to convert over a pretty complex internal method that relies heavily on the data.frame object. While I have said we should retain base R (I am mostly referring to the basic methods/within the SPC class) functionality, that of course has limits as you have pointed out. I will not lose any sleep if e.g. we omit the plyr::rbind.fill from union and make no attempt to write that mess in base R.

The examples I have done so far have been relatively simple to retain old data.frame functionality -- but munsell2rgb is considerably more complicated. A wholesale rewrite using data.table (the slice->dice) approach may be most appropriate -- I need to study how it works more and think on the options that are available.

Even without internal enhancements, there is value in having the wrapper method from above gist around it to handle merging of result back into SPC for the user and adding horizon/site IDs.

Keys

For some operations it is plainly evident that it is worth applying keys -- even on the fly. I think there is nothing wrong with setting key on the fly for operations like dice -- where there is tremendous benefit to be gained through the course of that function call -- even if the key was "thrown out" at the end.

I am cautious about more broad handling of keys ... i.e. within SPC methods like [ ... mostly because of my limited understanding.

E.g. in [... I explicitly constrain using by= I think it may be better to set the key in addition, but I've had mixed results testing this ... that led me to believe that they were performing essentially the same. I wasn't sure whether the key was being ignored or what.

In contrast, I was very surprised to see how badly the data.table code in min/max performed, and an 8x difference between keyed and unkeyed. I wondered whether I messed something in the DT syntax, making it way more costly -- since tapply weighed in at only 3x slower -- it didn't seem possible it could be that bad (or that tapply could be that good, rather). It is important to remember that a lot of the base R stuff is highly optimized!!

Clearly me "doing it wrong" is the most likely explanation.

I suspect the cumulative effect of setting keys, or doing joins where a key is expected but not set (or otherwise "defined" e.g. using by=), could be significant across a series of function calls for a typical workflow run on a large SPC. Though this is just a suspicion and it probably is [at least in part] wrong. This applies to a broader idea of allowing keys to be "re-used" and making sure they persist at some level within the SPC object itself -- not to on the fly keys.
I notice that keys sometimes aren't set when I "thought" they would be. I need to read up more on the by-reference and key-setting stuff for data.table, as I think it is the major conceptual gap I have relative to what my intuition for base R is. While im not a stranger to byref, its not how my R brain thinks.
My understanding is that the DT is generally not copied -- essentially just passing around references -- but some of the error messages I have got in weird cases running interactively have made me question my understanding. I am not clear on what non-data.table operations may create shallow copies versus true copies of data.tables and how this interacts with setting keys, operations etc. Apparently the unintentional copying thing is something the data.table creators struggle with.

I still need to experiment a lot more with this -- at this point I was elated to have something that worked let alone be several fold faster across the board. Figuring out the right "data.table" way to do things will be the needed for unlocking the final gains in performance and solving these mysteries. Before I started this I knew next to nothing about data.table. Now I feel like I have a pretty good handle on how it is different -- but am a total noob when it comes to effectively rewriting common base R workflows. I try many things and get many verbose errors from the data.table class... thankfully the error messages are pretty informative and there are many many guides and examples out there. It is clearly a very well thought out and powerful platform -- though IMO not for the uninitiated.

brownag commented 3 years ago

Forgot to tag this issue in recent commits adding the data.table import

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html

dylanbeaudette commented 3 years ago

EDIT: @brownag moved checkboxes to OP so that the progress tracking shows up in the issue

brownag commented 3 years ago

~~Probably related to merge.data.table downstream side-effects: https://github.com/ncss-tech/soilDB/issues/158~~ Above statement and below mention was a false alarm

brownag commented 3 years ago

moved checkboxes to OP so that the progress tracking shows up in the issue

ncss-tech / aqp