New plotting paradigms in vegan?

vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis

https://vegandevs.github.io/vegan/

GNU General Public License v2.0

442 stars 96 forks source link

New plotting paradigms in vegan? #257

Closed jarioksa closed 6 years ago

jarioksa commented 6 years ago

@gavinsimpson is working to implement ggplot2 plotting for vegan objects in his ggvegan package. This is a welcome addition, and really best kept in its separate package. However, I've been wondering if we should add some support for ggplot2 and other new (and divisive) techniques within vegan as well.

One simple thing that I just inspected was to add magrittr pipe support in traditional plotting functions. These seem to help in one of tedious and common vegan tasks: changing style of basic vegan ordination plot functions. The usual ordination plot does not allow changing its parameters, but you must first call plot(..., type="n") and then use text and points with the result. If you used non-default scaling (or hill or correlation), you must remember to use the same setting in these later functions as well. This is boring, tedious and error-prone. However, with a very trivial and small change in vegan (commit 14747b7e5426e4d2a587b429cad5b7a7580dc8f1), you can actually use magrittr pipe and have:

library(vegan)
library(magrittr)
data(dune, dune.env)
m <- rda(dune) # I resisted temptation to write dune %>% cca() %>% ...
## requires commit 14747b7e5426e4d2a587b429cad5b7a7580dc8f1
m %>% plot(type="n", scaling="si", corr=TRUE) %>% 
   points("sites", pch=16, col=dune.env$Management) %>% 
   text("species", col=4)

This works, because standard vegan plot functions return invisibly an ordiplot object with all necessary scores with text and points methods. The trivial change was that I made these points and text to pass on invisibly this very same ordiplot object so that it can be used by the next command in the pipeline. This looks like a much nicer idiom than the old one. However, this works only half-way: this does not allow adding biplot arrows in constrained ordination, nor any other extra layers (envfit results, elements like convex hull, ellipses etc.), and will only use points.ordiplot and text.ordiplot even when we now have dedicated text and points functions to separate ordination function. If the new idiom looks like an improvement, we could see opportunities to replace these dedicated functions with generic ordiplot methods and add missing pieces for adding arrows (which would work both for fitted vectors, vector constraints or biplot arrows for species) and perhaps also consider adding extra layers like convex hulls etc.

The new idiom was easily added because of two design decisions that I made in 2002 and 2003 and that look good in retrospect: Accessing ordination objects with scores function, and having plot functions to return an generic ordiplot object. This means also that we could change plotting structures internally by simultaneous update of scores (and only breaking scripts and dependent packages that bypass scores and access results directly). The ordiplot result object is a true blue R list. Now it seems that the major difference between ggplot2 and R is that ggplot2 uses a matrix-like data frame where the score type is given by an indicator variable instead of each being in a separate list element. We could quite well adopt this kind of structure either for scores or just for ordiplot and implement it in scores.ordiplot if this would help in interfacing vegan with ggvegan. I don't know the ggplot2 internals at all, but I'm willing to adapt vegan when it helps in producing alternative graphics in flexible way.

gavinsimpson commented 6 years ago

That's very interesting Jari.

Stepping back slightly, the main issue is that a fundamental concept of the %>% is that it involves data frames, which seems somewhat incompatible with ordination results which are naturally handled as a list of data frames. My only concern therefore is that adopting the %>% in this way breaks that general concept. That said I don't think this is something that Hadley & co get or necessarily want to enforce.

The type = 'n' bit is a little odd, but unavoidable without making more significant changes. Adding one extra step might alleviate that; what about basing this around scores() and plot that? For example:

m %>% scores(m, scaling="si", corr=TRUE) %>%
   plot() %>% 
   points("sites", pch=16, col=dune.env$Management) %>% 
   text("species", col=4)

It would require new wrappers around existing points(), text() etc, but they'd be relatively thin wrappers I imagine, and a new plot() method for scores(). These would still work by passing around lists of scores, not the fortified versions that I'm using in ggvegan.

Re:

ggplot2 uses a matrix-like data frame where the score type is given by an indicator variable instead of each being in a separate list element.

This is not something that ggplot insists upon; it's only something that I do in ggvegan to bridge between the tidyverse ecosystem and these kinds of multivariate data types. I'm thinking of changing the autoplot() methods in ggvegan to not use fortify() internally as you can use different data frames for layers in ggplot. Currently I'm fortify()-ing the scores to put them together in a single data frame then select()-ing to subset out particular score types to add layers.

psolymos commented 6 years ago

Jari and Gavin,

Really interesting. Another feature of ggplot2 and other packages which tend use %>% pipes is that creating the plots or other views (html in case of interactive maps and plots) as side effect is delayed until the user calls the print method. This would lead to a workflow like this:

p <- makeOrdiplot(m, scaling="si", corr=TRUE) %>% 
    addPoints("sites", pch=16, col=dune.env$Management) %>%
    addText("species", col=4)
p # plotting happens

The ordiplot object p would store the settings and data required for the graphics, and could be further manipulated by ggvegan functions. I am particularly thinking of leaflet and mapview packages as models.

gavinsimpson commented 6 years ago

Peter,

This basically sounds like developing our own wrappers around ggplot2 which seems suboptimal. Simple modifications to make %>% work with some base graphics calls is probably reasonable. Adding fortify() methods is what I envisage ggvegan providing — otherwise vegan would need to depend on ggplot2, which brings with it another pile of dependencies.

One option might be to provide modified versions of the geom_ functions (geom_ordipoints() for example) which were thin wrappers around the equivalentggplot2 geoms but which allowed one to specify which set of scores to work with in that layer.

I'm not sold on having a hybrid system of base graphics and delayed plotting which your workflow implies?

psolymos commented 6 years ago

@gavinsimpson and @jarioksa : I started to implement what I described and thought will be easy and eventually it turned into a rabbit hole. Lazy evaluation is not a good idea for plotting ordination because we mutate the state of the ordiplot object and not the graphical device (i.e. the 1st argument does not know about previous graphical parameters). This would require substantial changes, and as Jari demonstrated, minimal changes are sufficient with the type = "n" argument.

jarioksa commented 6 years ago

My intention was not to enforce a new idiom. The %>% pipe was just an alternative for this current idiom:

# works in vegan release
pl <- plot(m, type="n", scaling="si", corr=TRUE) 
points(pl, "sites", pch=16, col=dune.env$Management)
text(pl, "species", col=4)

Making points.ordiplot and text.ordiplot pass through the invisible "ordiplot" object allows pipeline as an alternative to saving and reusing the invisible(ordiplot.object). Something easier is needed in particular for constrained ordination plots which have several alternatives of scores and scaling. I think that the alternative above is not much used (even I tend to forget it although I invented it sometimes in 2003), and instead people (I included) often go these tedious lengths:

# assume 'm' is an rda() result
plot(m, type="n", scaling="si", corr=TRUE) 
points(m, "sites", pch=16, col=dune.env$Management, scaling="si")
text(m, "species", col=4, scaling="si", corr=TRUE)

and if you don't get confused and forget some scaling at this stage, you're bound to do that at the next step. The advantage of ordiplot objects is that they keep the scaling, correlation and hill arguments and save your from typing errors and complicated commands. The change I suggested only adds an option to pipe the commands instead of catching and re-using the invisible result of the first plot().

There are some obvious problems in either method:

Currently the first command sets the framework (axis ranges) and it defaults to use "wa" , "sp" and "cn" scores. If you don't plot all these items, your axis may be poorly scaled, like @psolymos noted
If you reuse the "ordiplot" object, you can only plot elements it contains. The default, for instance, does not have "lc" scores and does not allow using them.
There are currently only points and text methods, and the "ordiplot" methods do not handle biplot arrows and you cannot add them as a layer. The "cca" method can add biplot arrows, but then you need to use the original ordination object and remember to scale it correctly. This could be and should be added whether we use %>% or save & reuse the plotting object.
For basic graphics functions (points, text) we can easily provide a pipable alternative, but this is trickier for other functions that add items such as envfit, ordihull, ordiellipse etc, because these already return something else. Users may expect that they are pipable if points and text are, and they report the problems. These could be changed to be pipable, i.e., to pass on the plotting object, but this is more drastic change because they already return their own result objects.

Her is a pipeline alternative that takes care of correct scaling. Here I combine the ideas of @psolymos and @gavinsimpson:

# conceptual: does not work
scores(m, display="all", scaling="si", corr=TRUE) %>% 
addPoints("sites", pch=16, col=dune.env$Management) %>%
addText("species", col=4) %>%
plot()

Here scores() function would extract all possible scores from the result object with the defined scaling, and add* functions would pass on these scores unmodified and only add information that they want certain type of scores with given graphical properties, and then plot() function would only display what was requested and take no parameters (except, perhaps something like xlab, ylab, main). It is easy to see that this re-invents ggplot2 framework in the traditional graphics, and similar idiom would be:

# conceptual, does not work
ordigg(m) + geom_points(...) + ... + geom_text(...)

where ordigg takes the place of scores, and geom_* the place of add*. I think it would be better to develop ggvegan than a traditional graphics version of the same.

Something should be done for making constrained ordination objects more easily configurable. The dumb plotting is just as simple as plot(), but as soon as you want to change anything, the task gets exponentially harder.

We discussed plot.cca configuration when @gavinsimpson implemented biplot.rda function. There he ended up allowing graphical parameters of length two, but in PCA things are much simpler. Then I rejected the idea of allowing similar configurability in plot.cca, and I still do. I don't want to implement monsters like this:

# Doesn't work, and I'm happy it doesn't
plot(m, scaling="si", corr=TRUE, 
  display = list("si" = list(type="p", pch=16, col=dune.env$Management),
                "sp" = list(type="t", col=4))
)

And this is a simple example for just two kind of scores.

jarioksa commented 6 years ago

At the second (or third or something) thought, I think we could implement flexible plot.cca configuration with display = list('score' = list(graphic-options)). Nobody needs to use it if it is cumbersome, but using it looks simpler than any pipeline I can imagine -- we should still keep the current behaviour as an alternative. When I rejected the idea, I was thinking about the control = list() style of many R functions (see, e.g., optim), which can be painful (though not as painful as configuring lattice graphics).

I also think that we should add arrow drawing to points/text.ordiplot functions to draw biplot arrows etc. Also, I think we should have these functions to pass through the original input object. This allows pipelines, but does not enforce them.

Like my friend Dave Roberts said: give them so much rope that they can hang themselves, and then two feet more. That's the good old unix tradition. Some call it freedom.

jarioksa commented 6 years ago

I have now merged new pass-thru code in points and text.ordiplot. I also added one new keyword arrows to draw arrows from the origin to the scores, plus made the function to use this always with biplot and regression scores of "cca" objects. This means that now constrained ordination graphics can be drawn completely with ordiplot support functions, and it is also possible to draw the biplot() graph with these commands.

The new approach enables magrittr pipes, but does not enforce (or encourage) their use. With the new code, the following functions give the same graph, and all draw a biplot with species as arrows:

library(vegan)
library(magrittr)
data(dune, dune.env)
mod <- rda(dune)

## method 1: re-use invisible object
pl <- plot(mod, type="n", scaling="site")
points(pl, "si", pch=21, bg=dune.env$Management)
text(pl, "sp", arrows=TRUE, col=4, cex=0.7, xpd=TRUE, len=0.05)

## method 2: pipe
plot(mod, type="n", scaling="site") %>%
points("si", pch=21, bg=dune.env$Management) %>%
text("sp", arrows=TRUE, col=4, cex=0.7, xpd=TRUE, len=0.05)

## méthode 3: ceci n'est pas une pipe
text(points(plot(mod, type="n", scaling="site"), "si", pch=21, bg=dune.env$Management), 
    "sp", arrows=TRUE, col=4, cex=0.7, xpd=TRUE, len=0.05)

The last one is not pretty, but it works. @gavinsimpson may be disturbed by warning messages that we get by setting the "length" (actually, width) of arrow heads. However, biplot.rda could be written with ordiplot functions. The first method worked with the old functions (except that they did not yet have the arrows option), but the two others need points and textpass through the plotting object.

colinbrislawn commented 6 years ago

I'm not sure if this is within the scope of this issue, but what's vegan's stance on the style of Hadley Wickham / The Tidyverse? I ask because I just wrote this:

phyloseq_object %>% otu_table %>% as("matrix") %>% t %>%
betadiver(method = "j", triangular = F) %>%
as.matrix %>% reshape2::melt(varnames = c("S1", "S2")) %>%
head

This is super ugly, mostly because I'm converting my data into a vegan friendly matrix, than getting it back into a Tidyverse friendly data.frame.

When the majority of a project is stylistically Tidyverse, Vegan's old-school R feels clunky.

Is there a better way to handle this? Should I get better at old-school R? 😉

psolymos commented 6 years ago

@colinbrislawn : I am not sure I should be offended by calling things that don't break other things very often and don't accumulate dependencies exponentially old-school, but I won't 😉 ... instead here is a bit of rationale why community data cannot be considered tidy IMO.

A tidy representation has fields for space, time, taxonomy, etc. When we make a pivot table, e.g. sites x species, it can be used to cluster by rows or by columns, and we often make ordination biplots. There is no priority given to any of the dimensions. Somewhat similar argument applies to the output objects, which are rarely data frames (these methods are called multivariate for this reason).

If you consider the observations to be a single species in a plot, a tidy representation makes perfect sense. If you consider the observation to be the whole quadrat for example (with variable number of species from one site to the other), then a matrix is the right representation. The long (tidy) format is usually convenient from a storage perspective (why store lots of 0s).

BTW, vegan is usually satisfied with data frame inputs calling as.matrix internally.

colinbrislawn commented 6 years ago

Understood! The Tidyverse development model is manic, and with 10+ years of solid development and 12,000 citations, vegan is clearly doing something right! I definitely appreciate that vegan is fast and fully featured in base R.

I also appreciate your description of the challenges of representing community data in a way that satisfies the Tidy format. Packages like broom take the results of stat tests and presents them as data.frames, which is super powerful. I was wondering if vegan had something similar.

Thank you for your time!

dvrbts commented 6 years ago

I'm obviously late to the party (Jari just brought this to my attention today), but there are many issues associated with tidyverse that are problematic for community ecology data sets and analysis. labdsv has had wide and long format for taxa data for well more than a decade, but that only facilitates entering and storing sparse matrices; a full rectangular matrix is essential for the analyses we do. More importantly, we (almost always) have two separate data.frames (one for community data and one for site data), and we guarantee the association between the two with identical row.names. Tibble's war on row.names seems extremely counter-productive if you come from a database background.

You can argue about ggplot2's aesthetics (pardon the pun), but until it supports interaction with a mouse I'm not tempted to implement it in labdsv. Less important, but significant to me, I object to ggplot2 breaking the syntactical model long employed by formula-based functions in R. The ggformula package makes clear that the change was unnecessary, and simple examples like

library(ggformula) gf_point(mpg ~ hp, data = mtcars)

point to the continued use of R syntax in new graphic devices.

magrittr goes further, and converts algorithms to recipes. I'm sure that's the way some people think , but the necessity to write new functions like multiply_by() just makes it seem silly.

As the old joke goes, I can write FORTRAN in any language, and I have struggled with some of R's pre-tidy conventions, but the new ones don't wok for me.

colinbrislawn commented 6 years ago

Thanks for replying Dave. What a post! I think you got right to the heart of the issue:

As the old joke goes, I can write FORTRAN in any language,

Well sure you can, but that's like trying to write a loop in Haskell. This is a functional language, what are you doing!? Have mercy!

Tidy Data and data.frames %>% in %>% pipes is a new linguistic paradigm. We are arguing about which paradigm is better, and which one to support.

I'm thoroughly addicted to the Tidy Data programming model, even its crazy use-cases.

More importantly, we (almost always) have two separate data.frames

If not more! 1) feature abundance table, 2) sample annotations / environmental metadata 3) feature annotations / taxonomy names 4) feature annotations / genetic sequence ATCG 5) phylogenetic tree of features (which isn't a data.frame... unless)

And sometimes I psmelt() the first three into a single data.frame like a madman!

I have struggled with some of R's pre-tidy conventions, but the new ones don't work for me.

I have the exact opposite experience; I could feel my work speed up every time I added a tool like mutate to my tool belt, and packages like checkpoint let me manage a gratuitous number of dependencies.

I'm not asking the Vegan development team to change paradigm. I just want a way to get Tidy Data in and out of Vegan. Is this within the scope of the Vegan package?

dvrbts commented 6 years ago

Since I'm not a contributor to vegan I'll let the developers respond.

gavinsimpson commented 6 years ago

Getting tidy data out of vegan is the aim of the ggvegan package. Getting tidy data in is unlikely to happen with vegan; the user needs to provide data in an appropriate format just like you would with a regression model.

colinbrislawn commented 6 years ago

Getting tidy data out of vegan is the aim of the ggvegan package.

Thanks for writing that package. I'll check it out!

Getting tidy data in is unlikely to happen with vegan; the user needs to provide data in an appropriate format just like you would with a regression model.

That's fair. And tidy data is easy to reformat.

So it sounds like Tidy Data is outside of the scope of Vegan: One package for the Tinyverse, one package for the Tidyverse. Sounds like a good plan to me!

Thanks for walking me through the design patterns of Vegan, and putting up with a biologist from the Tidyverse. ❤️

jarioksa commented 6 years ago

We want to have clean data in vegan. Clean means that we do not want to mix non-data items with data, but data and its presentation are different things.

Vegan is intended for simple manipulation of clean data. It is true that data often are complex, but such complicated cases are better handled with S4-style data objects. For instance, phyloseq package has such S4 data objects which combine clean data with phylogeny of taxa, and it has S4 methods to make these objects clean for analysis with other functions, including vegan. Some other packages may have similar S4 classes for data combined with spatial structure, hierarchical sampling, repeated sampling or time series, species traits etc. Usually these work nicely as long as you have data in these idioms, but fall short if you have something different or more complicated (say spatially structured data with phylogenies and species traits). Such cases often are more easily handled with dedicated methods with simple clean data matrices.

I have no intention to have S4 classes in vegan: the more I look at S4 objects the less I like them. Neither I wish to change clean vegan structures into tidy ones. However, I wish to keep vegan compatible with these structures with the help of interface methods and packages. (I have checked that vegan passes its tests if all its data frames changed to tibbles.) I also think that it makes no harm to allow magrittr style chaining of functions, but I do not want force people to that idiom (btw, this plotting example is not a pipe but a chain on unchanged input objects). Further, I do not want to lock people in vegan, but I have tried to be a good R citizen and allow the use of vegan functions in pipelines with functions from other packages and allow the re-use of vegan functions within other functions.