Open dylanbeaudette opened 4 years ago
I will dump out a bunch of ideas, work that has been done so far, and some of the mysteries I am yet to figure out.
Most S4 methods that are fundamental to the SPC class and its basic operations (including subsetting, slot access, joins etc.) now respect the three different classes. It is a bit difficult to draw the conceptual line where "basic operations" stop and where "complicated" operations start... but here is an attempt:
Any method that takes SoilProfileCollection objects as input should be "aware" that the slots could be data.frame subclasses
Any method that builds an SPC (from depths<-
) should be aware that the input class will set the metadata (and all other df classes) for the object
Any method that uses the SoilProfileCollection()
constructor needs to make sure that the relevant values are set (I would not encourage anyone to use this. The less prevalent its use the better.)
To my knowledge there are three that use the constructor: [,SoilProfileCollection-method
, .initSPCfromMF
/depths<-
, and permute_profile
. permute_profile
is a relatively new one of mine, and should be converted to use depths<-
or possibly further refactored. I don't think there is anything in there that warrants using the constructor rather than depths + other setter methods/joins.
Methods have the ability to:
aqp_df_class(object)
There are several methods out there that do not have an method signature to ensure SPC-input-relationship. Also, there are several methods that have an option to return SPCs that do not currently respect these features (return data.frame SPCs no matter what input). The latter methods only have a mechanism to "respect" the aqp_df_class if they receive an SPC as input (i.e. they need the metadata in addition to the data) OR if the class is given as an argument/calculated internally from something else.
Most of the work of data.table and tbl_df inheritance is done automatically by defining their "old classes" using setOldClass()
. This allows things like merge(x,y)
to automatically trigger merge.data.table
, for instance, if x is data.table. This allows much of the important code for e.g. joins inside the object to be written in an unambiguous way with little effort / conceptual overhead.
There are two special internal aqp methods for working with slots that may contain data.frame-like classes
object
is the SPCtarget_class
is a character vector unit length e.g. "data.frame" [default/failsafe] "data.table" or "tbl_df." coluimn_names
is a vector of column names (as would be used in the data.frame column or "j"-index).as.data.frame.aqp(object, target_class)
.data.frame.j(object, column_names, target_class)
object <- fetchSPCSomehow()
cn <- c("column1", "column2")
# dat could contain data.frame, data.table or tbl_df class at this point
dat <- horizons(object)
# inspect
print(aqp_df_class(object))
Working with data.table object j index subsetting in an unambiguous way requires that one rewrite:
# 1. allowed for data.frame and tbl_df but NOT data.table
dat[,cn]
as
# 2. allowed for all three classes
.data.frame.j(dat, cn, aqp_df_class(object))
Anywhere one is using NUMERIC j index should be refactored to use column names (which are much less ambiguous/prone to error). This is generally good practice even for pure data.frames, IMO, and should be easy (index + names attribute).
Where appropriate we should consider defining functions that take an SPC object as input as S4 methods with SPC method signature. I also find that methods that return SPCs are really handy for %>%
-based workflows.
I don't think we should overdo it with wrapper methods, but I also think that the point of S4 is at least in part to dispatch methods for our special class and make sure things are handled safely (and ideally easily for the user). So if there is a major benefit to getting a particular aqp functions' output into SPC, lets define that method it in aqp, rather than leaving the user to figure it out.
The gist below shows a basic wrapper around munsell2rgb
. I want to highlight how handling of data.frames for slots within the SPC works in series with on-the-fly data.table implementations for specific functions.
https://gist.github.com/brownag/e57cb66e32ea0842b7e507a18e228906
The wrapper method could be dispatched via S4 (e.g. munsell2rgb,SoilProfileCollection-method
) so one could use the 3-vector or the SPC interface to munsell2rgb
(using the same function name, different method signature).
However, there is a ton of heavy lifting that goes on inside munsell2rgb(hue, value, chroma, ...)
that could probably be optimized using data.table. In the current system where data.table is in Suggests one would add a requireNamespace
in the relevant scope to trigger (or error) for the data.table
code.
Its not clear how much more effort should be put into retaining historic data.frame implementations where data.table is obviously better or dramatically simplifies the code. I have no good suggestions at this point, but munsell2rgb
may pose a really great case study for how to convert over a pretty complex internal method that relies heavily on the data.frame object. While I have said we should retain base R (I am mostly referring to the basic methods/within the SPC class) functionality, that of course has limits as you have pointed out. I will not lose any sleep if e.g. we omit the plyr::rbind.fill
from union
and make no attempt to write that mess in base R.
The examples I have done so far have been relatively simple to retain old data.frame functionality -- but munsell2rgb
is considerably more complicated. A wholesale rewrite using data.table (the slice
->dice
) approach may be most appropriate -- I need to study how it works more and think on the options that are available.
Even without internal enhancements, there is value in having the wrapper method from above gist around it to handle merging of result back into SPC for the user and adding horizon/site IDs.
For some operations it is plainly evident that it is worth applying keys -- even on the fly. I think there is nothing wrong with setting key on the fly for operations like dice
-- where there is tremendous benefit to be gained through the course of that function call -- even if the key was "thrown out" at the end.
I am cautious about more broad handling of keys ... i.e. within SPC methods like [
... mostly because of my limited understanding.
E.g. in [
... I explicitly constrain using by=
I think it may be better to set the key in addition, but I've had mixed results testing this ... that led me to believe that they were performing essentially the same. I wasn't sure whether the key was being ignored or what.
In contrast, I was very surprised to see how badly the data.table code in min/max performed, and an 8x difference between keyed and unkeyed. I wondered whether I messed something in the DT syntax, making it way more costly -- since tapply
weighed in at only 3x slower -- it didn't seem possible it could be that bad (or that tapply
could be that good, rather). It is important to remember that a lot of the base R stuff is highly optimized!!
I suspect the cumulative effect of setting keys, or doing joins where a key is expected but not set (or otherwise "defined" e.g. using by=
), could be significant across a series of function calls for a typical workflow run on a large SPC. Though this is just a suspicion and it probably is [at least in part] wrong. This applies to a broader idea of allowing keys to be "re-used" and making sure they persist at some level within the SPC object itself -- not to on the fly keys.
I notice that keys sometimes aren't set when I "thought" they would be. I need to read up more on the by-reference and key-setting stuff for data.table, as I think it is the major conceptual gap I have relative to what my intuition for base R is. While im not a stranger to byref, its not how my R brain thinks.
My understanding is that the DT is generally not copied -- essentially just passing around references -- but some of the error messages I have got in weird cases running interactively have made me question my understanding. I am not clear on what non-data.table operations may create shallow copies versus true copies of data.tables and how this interacts with setting keys, operations etc. Apparently the unintentional copying thing is something the data.table creators struggle with.
I still need to experiment a lot more with this -- at this point I was elated to have something that worked let alone be several fold faster across the board. Figuring out the right "data.table" way to do things will be the needed for unlocking the final gains in performance and solving these mysteries. Before I started this I knew next to nothing about data.table
. Now I feel like I have a pretty good handle on how it is different -- but am a total noob when it comes to effectively rewriting common base R workflows. I try many things and get many verbose errors from the data.table class... thankfully the error messages are pretty informative and there are many many guides and examples out there. It is clearly a very well thought out and powerful platform -- though IMO not for the uninitiated.
Forgot to tag this issue in recent commits adding the data.table import
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html
EDIT: @brownag moved checkboxes to OP so that the progress tracking shows up in the issue
Probably related to
Above statement and below mention was a false alarmmerge.data.table
downstream side-effects: https://github.com/ncss-tech/soilDB/issues/158
moved checkboxes to OP so that the progress tracking shows up in the issue
Keeping track of changes / testing related to
data.table
replacement ofplyr
/reshape
functionality. Each change should be checked and a new test added if there isn't sufficient coverage. See related issue in soilDB and sharpshootR.plyr
join
→base::merge(...,
incomparables = NA, sort = FALSE)
ordata.table
joinSoilProfileCollection-coercion
SoilProfileCollection-slice-methods
(deprecated, in favor ofdice
)munsell2rgb
ddply
SoilProfileCollection-setters
aggregateColor
aggregateSoilDepth
getmlhz
missingDataGrid
llply
profile_compare
ldply
SoilProfileCollection-slice-methods
(deprecated, in favor ofdice
)aggregateColor
soilColorSignature
dlply
aggregateColor
profile_compare
(deprecated, in favor ofNCSP
)reshape
cast
→data.table::dcast
evalGenHz
soilColorSignature
Note that
value
argument becomesvalue.var
.melt
→data.table::melt
plotColorQuantiles
evalGenHz
slab
slab
examples will needreshape
package alternativessoilColorSignature
If the object is not a
data.table
then perform on-the-fly conversion to/from. The following works inplotColorQuantiles
:However, there following appears with a similar conversion in
evalGenHz
:Which is not a problem, and can be suppressed for now. I've added a TODO item for converting all affected columns to
numeric
.Caveats / Notes
data.table::melt
will not preserve factor type / levels unless asked to do so explicitly viavalue.factor
argument. A reasonable solution is to set it after the reshape operation when appropriate.id.vars
should not contain duplicated column nameseval
to dereference variables inside ofdata.table[ , ... ]
I'm fine with the current trajectory as long as we don't (yet) guarantee
class
-in ==class
-out for all functions inaqp
.A Plan
data.frame
,tbl_df
,data.table
end-to-end support for functions using / returningSoilProfileCollection
objectsdata.table
usage (results are stilldata.frame
) to remaining functions where it could give significant performance gain (slice
,slab
,munsell2rgb
, etc.)SoilProfileCollection
objects will respect input table classRelated issues
3
79
114
115
118
129
134
135