Closed jarioksa closed 6 years ago
That's very interesting Jari.
Stepping back slightly, the main issue is that a fundamental concept of the %>%
is that it involves data frames, which seems somewhat incompatible with ordination results which are naturally handled as a list of data frames. My only concern therefore is that adopting the %>%
in this way breaks that general concept. That said I don't think this is something that Hadley & co get or necessarily want to enforce.
The type = 'n'
bit is a little odd, but unavoidable without making more significant changes. Adding one extra step might alleviate that; what about basing this around scores()
and plot that? For example:
m %>% scores(m, scaling="si", corr=TRUE) %>%
plot() %>%
points("sites", pch=16, col=dune.env$Management) %>%
text("species", col=4)
It would require new wrappers around existing points()
, text()
etc, but they'd be relatively thin wrappers I imagine, and a new plot()
method for scores()
. These would still work by passing around lists of scores, not the fortified versions that I'm using in ggvegan.
Re:
ggplot2 uses a matrix-like data frame where the score type is given by an indicator variable instead of each being in a separate list element.
This is not something that ggplot insists upon; it's only something that I do in ggvegan to bridge between the tidyverse ecosystem and these kinds of multivariate data types. I'm thinking of changing the autoplot()
methods in ggvegan to not use fortify()
internally as you can use different data frames for layers in ggplot. Currently I'm fortify()
-ing the scores to put them together in a single data frame then select()
-ing to subset out particular score types to add layers.
Jari and Gavin,
Really interesting. Another feature of ggplot2 and other packages which tend use %>%
pipes is that creating the plots or other views (html in case of interactive maps and plots) as side effect is delayed until the user calls the print
method. This would lead to a workflow like this:
p <- makeOrdiplot(m, scaling="si", corr=TRUE) %>%
addPoints("sites", pch=16, col=dune.env$Management) %>%
addText("species", col=4)
p # plotting happens
The ordiplot object p
would store the settings and data required for the graphics, and could be further manipulated by ggvegan functions. I am particularly thinking of leaflet and mapview packages as models.
Peter,
This basically sounds like developing our own wrappers around ggplot2 which seems suboptimal. Simple modifications to make %>%
work with some base graphics calls is probably reasonable. Adding fortify()
methods is what I envisage ggvegan providing — otherwise vegan would need to depend on ggplot2, which brings with it another pile of dependencies.
One option might be to provide modified versions of the geom_
functions (geom_ordipoints()
for example) which were thin wrappers around the equivalentggplot2 geoms but which allowed one to specify which set of scores to work with in that layer.
I'm not sold on having a hybrid system of base graphics and delayed plotting which your workflow implies?
@gavinsimpson and @jarioksa : I started to implement what I described and thought will be easy and eventually it turned into a rabbit hole. Lazy evaluation is not a good idea for plotting ordination because we mutate the state of the ordiplot
object and not the graphical device (i.e. the 1st argument does not know about previous graphical parameters). This would require substantial changes, and as Jari demonstrated, minimal changes are sufficient with the type = "n"
argument.
My intention was not to enforce a new idiom. The %>%
pipe was just an alternative for this current idiom:
# works in vegan release
pl <- plot(m, type="n", scaling="si", corr=TRUE)
points(pl, "sites", pch=16, col=dune.env$Management)
text(pl, "species", col=4)
Making points.ordiplot
and text.ordiplot
pass through the invisible "ordiplot"
object allows pipeline as an alternative to saving and reusing the invisible(ordiplot.object)
. Something easier is needed in particular for constrained ordination plots which have several alternatives of scores and scaling. I think that the alternative above is not much used (even I tend to forget it although I invented it sometimes in 2003), and instead people (I included) often go these tedious lengths:
# assume 'm' is an rda() result
plot(m, type="n", scaling="si", corr=TRUE)
points(m, "sites", pch=16, col=dune.env$Management, scaling="si")
text(m, "species", col=4, scaling="si", corr=TRUE)
and if you don't get confused and forget some scaling
at this stage, you're bound to do that at the next step. The advantage of ordiplot
objects is that they keep the scaling
, correlation
and hill
arguments and save your from typing errors and complicated commands. The change I suggested only adds an option to pipe the commands instead of catching and re-using the invisible result of the first plot()
.
There are some obvious problems in either method:
"wa"
, "sp"
and "cn"
scores. If you don't plot all these items, your axis may be poorly scaled, like @psolymos noted"ordiplot"
object, you can only plot elements it contains. The default, for instance, does not have "lc"
scores and does not allow using them. points
and text
methods, and the "ordiplot"
methods do not handle biplot arrows and you cannot add them as a layer. The "cca"
method can add biplot arrows, but then you need to use the original ordination object and remember to scale it correctly. This could be and should be added whether we use %>%
or save & reuse the plotting object.points
, text
) we can easily provide a pipable alternative, but this is trickier for other functions that add items such as envfit
, ordihull
, ordiellipse
etc, because these already return something else. Users may expect that they are pipable if points
and text
are, and they report the problems. These could be changed to be pipable, i.e., to pass on the plotting object, but this is more drastic change because they already return their own result objects.Her is a pipeline alternative that takes care of correct scaling. Here I combine the ideas of @psolymos and @gavinsimpson:
# conceptual: does not work
scores(m, display="all", scaling="si", corr=TRUE) %>%
addPoints("sites", pch=16, col=dune.env$Management) %>%
addText("species", col=4) %>%
plot()
Here scores()
function would extract all possible scores from the result object with the defined scaling, and add*
functions would pass on these scores unmodified and only add information that they want certain type of scores with given graphical properties, and then plot()
function would only display what was requested and take no parameters (except, perhaps something like xlab
, ylab
, main
). It is easy to see that this re-invents ggplot2 framework in the traditional graphics, and similar idiom would be:
# conceptual, does not work
ordigg(m) + geom_points(...) + ... + geom_text(...)
where ordigg
takes the place of scores
, and geom_*
the place of add*
. I think it would be better to develop ggvegan than a traditional graphics version of the same.
Something should be done for making constrained ordination objects more easily configurable. The dumb plotting is just as simple as plot()
, but as soon as you want to change anything, the task gets exponentially harder.
We discussed plot.cca
configuration when @gavinsimpson implemented biplot.rda
function. There he ended up allowing graphical parameters of length two, but in PCA things are much simpler. Then I rejected the idea of allowing similar configurability in plot.cca
, and I still do. I don't want to implement monsters like this:
# Doesn't work, and I'm happy it doesn't
plot(m, scaling="si", corr=TRUE,
display = list("si" = list(type="p", pch=16, col=dune.env$Management),
"sp" = list(type="t", col=4))
)
And this is a simple example for just two kind of scores.
At the second (or third or something) thought, I think we could implement flexible plot.cca
configuration with display = list('score' = list(graphic-options))
. Nobody needs to use it if it is cumbersome, but using it looks simpler than any pipeline I can imagine -- we should still keep the current behaviour as an alternative. When I rejected the idea, I was thinking about the control = list()
style of many R functions (see, e.g., optim
), which can be painful (though not as painful as configuring lattice graphics).
I also think that we should add arrow drawing to points/text.ordiplot
functions to draw biplot arrows etc. Also, I think we should have these functions to pass through the original input object. This allows pipelines, but does not enforce them.
Like my friend Dave Roberts said: give them so much rope that they can hang themselves, and then two feet more. That's the good old unix tradition. Some call it freedom.
I have now merged new pass-thru code in points
and text.ordiplot
. I also added one new keyword arrows
to draw arrows from the origin to the scores, plus made the function to use this always with biplot
and regression
scores of "cca"
objects. This means that now constrained ordination graphics can be drawn completely with ordiplot
support functions, and it is also possible to draw the biplot()
graph with these commands.
The new approach enables magrittr pipes, but does not enforce (or encourage) their use. With the new code, the following functions give the same graph, and all draw a biplot with species as arrows:
library(vegan)
library(magrittr)
data(dune, dune.env)
mod <- rda(dune)
## method 1: re-use invisible object
pl <- plot(mod, type="n", scaling="site")
points(pl, "si", pch=21, bg=dune.env$Management)
text(pl, "sp", arrows=TRUE, col=4, cex=0.7, xpd=TRUE, len=0.05)
## method 2: pipe
plot(mod, type="n", scaling="site") %>%
points("si", pch=21, bg=dune.env$Management) %>%
text("sp", arrows=TRUE, col=4, cex=0.7, xpd=TRUE, len=0.05)
## méthode 3: ceci n'est pas une pipe
text(points(plot(mod, type="n", scaling="site"), "si", pch=21, bg=dune.env$Management),
"sp", arrows=TRUE, col=4, cex=0.7, xpd=TRUE, len=0.05)
The last one is not pretty, but it works. @gavinsimpson may be disturbed by warning messages that we get by setting the "length" (actually, width) of arrow heads. However, biplot.rda
could be written with ordiplot
functions. The first method worked with the old functions (except that they did not yet have the arrows
option), but the two others need points
and text
pass through the plotting object.
I'm not sure if this is within the scope of this issue, but what's vegan's stance on the style of Hadley Wickham / The Tidyverse? I ask because I just wrote this:
phyloseq_object %>% otu_table %>% as("matrix") %>% t %>%
betadiver(method = "j", triangular = F) %>%
as.matrix %>% reshape2::melt(varnames = c("S1", "S2")) %>%
head
This is super ugly, mostly because I'm converting my data into a vegan friendly matrix, than getting it back into a Tidyverse friendly data.frame.
When the majority of a project is stylistically Tidyverse, Vegan's old-school R feels clunky.
Is there a better way to handle this? Should I get better at old-school R? 😉
@colinbrislawn : I am not sure I should be offended by calling things that don't break other things very often and don't accumulate dependencies exponentially old-school, but I won't 😉 ... instead here is a bit of rationale why community data cannot be considered tidy IMO.
A tidy representation has fields for space, time, taxonomy, etc. When we make a pivot table, e.g. sites x species, it can be used to cluster by rows or by columns, and we often make ordination biplots. There is no priority given to any of the dimensions. Somewhat similar argument applies to the output objects, which are rarely data frames (these methods are called multivariate for this reason).
If you consider the observations to be a single species in a plot, a tidy representation makes perfect sense. If you consider the observation to be the whole quadrat for example (with variable number of species from one site to the other), then a matrix is the right representation. The long (tidy) format is usually convenient from a storage perspective (why store lots of 0s).
BTW, vegan is usually satisfied with data frame inputs calling as.matrix
internally.
Understood! The Tidyverse development model is manic, and with 10+ years of solid development and 12,000 citations, vegan is clearly doing something right! I definitely appreciate that vegan is fast and fully featured in base R.
I also appreciate your description of the challenges of representing community data in a way that satisfies the Tidy format. Packages like broom take the results of stat tests and presents them as data.frames, which is super powerful. I was wondering if vegan had something similar.
Thank you for your time!
I'm obviously late to the party (Jari just brought this to my attention today), but there are many issues associated with tidyverse that are problematic for community ecology data sets and analysis. labdsv has had wide and long format for taxa data for well more than a decade, but that only facilitates entering and storing sparse matrices; a full rectangular matrix is essential for the analyses we do. More importantly, we (almost always) have two separate data.frames (one for community data and one for site data), and we guarantee the association between the two with identical row.names. Tibble's war on row.names seems extremely counter-productive if you come from a database background.
You can argue about ggplot2's aesthetics (pardon the pun), but until it supports interaction with a mouse I'm not tempted to implement it in labdsv. Less important, but significant to me, I object to ggplot2 breaking the syntactical model long employed by formula-based functions in R. The ggformula package makes clear that the change was unnecessary, and simple examples like
library(ggformula) gf_point(mpg ~ hp, data = mtcars)
point to the continued use of R syntax in new graphic devices.
magrittr goes further, and converts algorithms to recipes. I'm sure that's the way some people think , but the necessity to write new functions like multiply_by() just makes it seem silly.
As the old joke goes, I can write FORTRAN in any language, and I have struggled with some of R's pre-tidy conventions, but the new ones don't wok for me.
Thanks for replying Dave. What a post! I think you got right to the heart of the issue:
As the old joke goes, I can write FORTRAN in any language,
Well sure you can, but that's like trying to write a loop in Haskell. This is a functional language, what are you doing!? Have mercy!
Tidy Data and data.frames %>% in %>% pipes
is a new linguistic paradigm. We are arguing about which paradigm is better, and which one to support.
I'm thoroughly addicted to the Tidy Data programming model, even its crazy use-cases.
More importantly, we (almost always) have two separate data.frames
If not more! 1) feature abundance table, 2) sample annotations / environmental metadata 3) feature annotations / taxonomy names 4) feature annotations / genetic sequence ATCG 5) phylogenetic tree of features (which isn't a data.frame... unless)
And sometimes I psmelt()
the first three into a single data.frame like a madman!
I have struggled with some of R's pre-tidy conventions, but the new ones don't work for me.
I have the exact opposite experience; I could feel my work speed up every time I added a tool like mutate
to my tool belt, and packages like checkpoint
let me manage a gratuitous number of dependencies.
I'm not asking the Vegan development team to change paradigm. I just want a way to get Tidy Data in and out of Vegan. Is this within the scope of the Vegan package?
Since I'm not a contributor to vegan I'll let the developers respond.
Getting tidy data out of vegan is the aim of the ggvegan package. Getting tidy data in is unlikely to happen with vegan; the user needs to provide data in an appropriate format just like you would with a regression model.
Getting tidy data out of vegan is the aim of the ggvegan package.
Thanks for writing that package. I'll check it out!
Getting tidy data in is unlikely to happen with vegan; the user needs to provide data in an appropriate format just like you would with a regression model.
That's fair. And tidy data is easy to reformat.
So it sounds like Tidy Data is outside of the scope of Vegan: One package for the Tinyverse, one package for the Tidyverse. Sounds like a good plan to me!
Thanks for walking me through the design patterns of Vegan, and putting up with a biologist from the Tidyverse. ❤️
We want to have clean data in vegan. Clean means that we do not want to mix non-data items with data, but data and its presentation are different things.
Vegan is intended for simple manipulation of clean data. It is true that data often are complex, but such complicated cases are better handled with S4-style data objects. For instance, phyloseq package has such S4 data objects which combine clean data with phylogeny of taxa, and it has S4 methods to make these objects clean for analysis with other functions, including vegan. Some other packages may have similar S4 classes for data combined with spatial structure, hierarchical sampling, repeated sampling or time series, species traits etc. Usually these work nicely as long as you have data in these idioms, but fall short if you have something different or more complicated (say spatially structured data with phylogenies and species traits). Such cases often are more easily handled with dedicated methods with simple clean data matrices.
I have no intention to have S4 classes in vegan: the more I look at S4 objects the less I like them. Neither I wish to change clean vegan structures into tidy ones. However, I wish to keep vegan compatible with these structures with the help of interface methods and packages. (I have checked that vegan passes its tests if all its data frames changed to tibbles.) I also think that it makes no harm to allow magrittr style chaining of functions, but I do not want force people to that idiom (btw, this plotting example is not a pipe but a chain on unchanged input objects). Further, I do not want to lock people in vegan, but I have tried to be a good R citizen and allow the use of vegan functions in pipelines with functions from other packages and allow the re-use of vegan functions within other functions.
@gavinsimpson is working to implement ggplot2 plotting for vegan objects in his ggvegan package. This is a welcome addition, and really best kept in its separate package. However, I've been wondering if we should add some support for ggplot2 and other new (and divisive) techniques within vegan as well.
One simple thing that I just inspected was to add magrittr pipe support in traditional plotting functions. These seem to help in one of tedious and common vegan tasks: changing style of basic vegan ordination
plot
functions. The usual ordinationplot
does not allow changing its parameters, but you must first callplot(..., type="n")
and then usetext
andpoints
with the result. If you used non-defaultscaling
(orhill
orcorrelation
), you must remember to use the same setting in these later functions as well. This is boring, tedious and error-prone. However, with a very trivial and small change in vegan (commit 14747b7e5426e4d2a587b429cad5b7a7580dc8f1), you can actually use magrittr pipe and have:This works, because standard vegan
plot
functions return invisibly anordiplot
object with all necessary scores withtext
andpoints
methods. The trivial change was that I made thesepoints
andtext
to pass on invisibly this very sameordiplot
object so that it can be used by the next command in the pipeline. This looks like a much nicer idiom than the old one. However, this works only half-way: this does not allow adding biplot arrows in constrained ordination, nor any other extra layers (envfit
results, elements like convex hull, ellipses etc.), and will only usepoints.ordiplot
andtext.ordiplot
even when we now have dedicatedtext
andpoints
functions to separate ordination function. If the new idiom looks like an improvement, we could see opportunities to replace these dedicated functions with genericordiplot
methods and add missing pieces for adding arrows (which would work both for fitted vectors, vector constraints or biplot arrows for species) and perhaps also consider adding extra layers like convex hulls etc.The new idiom was easily added because of two design decisions that I made in 2002 and 2003 and that look good in retrospect: Accessing ordination objects with
scores
function, and havingplot
functions to return an genericordiplot
object. This means also that we could change plotting structures internally by simultaneous update ofscores
(and only breaking scripts and dependent packages that bypassscores
and access results directly). Theordiplot
result object is a true blue R list. Now it seems that the major difference between ggplot2 and R is that ggplot2 uses a matrix-like data frame where the score type is given by an indicator variable instead of each being in a separate list element. We could quite well adopt this kind of structure either forscores
or just forordiplot
and implement it inscores.ordiplot
if this would help in interfacing vegan with ggvegan. I don't know the ggplot2 internals at all, but I'm willing to adapt vegan when it helps in producing alternative graphics in flexible way.