qiime2 / q2-vizard

The first choice of wizard lizards for interactive, generalized microbiome data visualization!
BSD 3-Clause "New" or "Revised" License
0 stars 9 forks source link

NEW: adds lineplot #17

Closed lizgehret closed 1 month ago

lizgehret commented 3 months ago

okay some philosophical thoughts on how functionality should work here - re-thinking this as being similar to scatterplot (because it's not as similar as i originally thought).

my instinct here is to make this less of an exploratory plot than w/scatter (where you can utilize drop downs to view all different groupings of x/y data) since this feels like there's more intention required to view something meaningful. also open to ideas about why the drop downs should be retained (i could totally be missing why that would make sense).

nbokulich commented 3 months ago

Hi @lizgehret I see what you mean that this is not as flexible as a scatterplot. Similar to the scatterplot:

  1. x and y must both be numeric
  2. a categorical variable (or numeric variable) can be used for grouping (coloring and curves here; coloring only, but possibly also eventually other options in scatterplot)

It seems like the main constraint is that the grouping variable should have multiple values of x to create a line/curve. So what if the following is done:

  1. all numeric metadata columns can be used for y and are added to the drop-down menu for y.
  2. all numeric metadata columns passed as input are screened to see which columns have at least 2 unique values. This forms the drop-down options for x. (but I have an edge case to think about*)
  3. all categorical and numeric columns that have at least 2 unique values form the options for "grouping"

if the 'grouped' param is False, all values in x measure will be ordered (and maybe error if any identical values are found?) and then plotted against the y measure.

There should not be an error if there are identical values of x — whether grouping is True/False we would expect that there could be replicate values (e.g., biological or technical replicates), this would be typical in many experiments. In fact, these could/should be used (when available) to calculate a confidence interval that could be toggled on/off.

my instinct here is to make this less of an exploratory plot than w/scatter

I agree. Scatter is very exploratory. Line/curveplots less so, but still to some degree. Here are some concrete cases where we would want to use this in an exploratory fashion and give flexibility with drop-downs:

  1. passing multiple alpha vectors alongside numeric metadata; users may want to click through the different metrics to display on the x-axis for visualizing correlations with some numeric variable x. (basically, a replacement for the basic alpha correlation plots we have now). Other variables (e.g., PC coordinates) could also be passed for y, so this could enhance this to allow alpha or beta correlation.
  2. same for alpha rarefaction (though currently this is a box + line plot, so maybe needs a different visualization, but I think that a curveplot + confidence interval would be a fine representation)
  3. In other plugins like q2-quality-control, RESCRIPt, etc there are some actions that create multiple different scores that could be displayed on a curveplot. Right now q2-qc has static line plots showing accuracy at multiple depths; RESCRIPt uses q2-longitudinal's volatility plots to do also display accuracy at different taxonomic ranks.

To contrast: we have q2-longitudinal's volatility plots for less exploratory more structured cases where a user wants to display different y variables against a fixed numeric x variable; categorical grouping variables can be selected from a drop-down, and optionally a fourth (fixed) variable can be used to define individuals within groups that had repeated measures taken. q2-vizard's curveplot should be more flexible and exploratory than this. Users who have, e.g., temporal data should already be using q2-longitudinal anyway; so I see q2-vizard's curveplot as filling the niche for a more general use line/curveplot.


* Edge case: do we always want to enforce that x is numeric? y must always be numeric. But there are cases when a user may want to pass a categorical variable as x, which is then ordered and used for plotting. One example: showing taxonomic classification accuracy (or other metrics) at multiple taxonomic ranks. Or if a user wants to show alpha diversity (or some other metric) at, say, multiple ordered sites that have categorical variables (e.g., named sites that follow a transect). Obviously that user could just make a numeric encoding, plot, and then relabel. But we should consider either:

  1. allowing categorical values to be used for x, which are then ordered alphabetically (or perhaps based on some instructions that the user can pass)
  2. allow relabeling (messy!)

Either way, this will take more thought and is an ENH issue for follow up, not something to worry about in the initial version. In the initial version I think curveplot should require numeric values for x.

lizgehret commented 3 months ago

Hey @nbokulich,

Thanks for such a detailed follow-up! I discussed all of this with @ebolyen this morning, and here's what we came up with (for both V1 and V2 versions):

V1.

V2.

Here's a sketch I made to help with the visual representation of what V1 will look like: IMG_1163

Let me know if you have any thoughts/concerns about anything above!

nbokulich commented 3 months ago

Hi @lizgehret sounds like a good plan!

Though replicate handling will be important to add soon, as I expect most use cases to have replicates so erroring out will block use until that is added.

lizgehret commented 2 months ago

Notes for myself:

lizgehret commented 2 months ago

Okay this should be ready for a first round of review - @gregcaporaso @nbokulich feel free to play around with this with a few different datasets. I've been using a modified version of PD mice, so any changes that are needed may arise from trying to break things with bigger/different datasets.

lizgehret commented 2 months ago

Something to note is that I refactored the existing tests into a few that test the helper method for measure validation (used heavily for both scatterplot and curveplot) which removes a lot of duplicated tests. This isn't technically within the scope of this new visualizer, but it was helpful to do this here to speed up testing and to clean up tests within curveplot. Apologies in advance for adding to the diff with those changes 🙇🏼‍♀️

nbokulich commented 2 months ago

Hi @lizgehret looks good to me for V1!

I was only able to test with the PD mouse dataset — other/larger datasets I have basically all have X replicates, which adds to my opinion that V2 will be essential before this is ready for action.

I like that line/step options are given! From the illustration above and the name change to curveplot I also assumed that a curve option would be exposed — is this in the planning already? Could be a neat option to add.

The error message about X replicates could be improved if this is a permanent feature, but I think this is only temporary anyway so that's fine.

Note: I only did user testing, I did not review the code.

lizgehret commented 2 months ago

Thanks for the review @nbokulich! A few updates below:

I was only able to test with the PD mouse dataset — other/larger datasets I have basically all have X replicates, which adds to my opinion that V2 will be essential before this is ready for action.

Okay so I spent quite a bit of time discussing this with @gregcaporaso and @cherman2 offline, and here's what we've decided is the best approach. It's important that we are very clear in any extrapolation we are making on a user's data (i.e. creating lines that aren't a direct connection between two existing data points, but instead some average of that particular set of data points) while also providing enough flexibility in the allowable inputs so that users aren't tempted to modify their metadata to work with this visualization (thus inherently masking the true nature of their data).

X and Y will both be fixed numerical measures in this update, and a user must specify explicitly whether or not the measure they're choosing for X will contain replicates. If replicates are enabled, they also must specify the type of extrapolation that will be made on their data (mean, median, mode). They can also select a discrete measure for 'facetBy' (this name replaces 'group' in the current draft). Y is a fixed measure in this update because there is now a requisite pre-processing step that occurs prior to the data being passed into the Vega spec - all Y values associated with the replicate X value(s) are averaged and then passed into the vega spec as a separate table, which will be used to draw an 'average' line that will display where the average at each set of points falls with respect to the actual data (which will also be plotted as circular marks). I'm also going to add a tooltip to hover over each 'point' (both actual XY points and averaged 'point' locations) that will display the coordinates along with the extrapolation method (if replicates are enabled).

I like that line/step options are given! From the illustration above and the name change to curveplot I also assumed that a curve option would be exposed — is this in the planning already? Could be a neat option to add.

Good question - the name change was something @ebolyen had suggested since we're technically creating mathematical curves to connect data points. But I think it's going to cause more confusion for folks since this won't include a modeled 'best fit' curve - which I think is what most users associate a curve with. I'm going to revert the name back to lineplot!

gregcaporaso commented 2 months ago

Thanks @lizgehret!

If replicates are enabled, they also must specify the type of extrapolation that will be made on their data (mean, median, mode)

I suggest making the default median in this case, since that's the most sensible generally but some users might choose mean by default if not provided with a default value.

I also don't think mode is needed and suggest not providing it as an option. I generally find that to not be very useful for continuous values.

lizgehret commented 2 months ago

Okay, updated version should be ready for another round of review (cc @gregcaporaso). I created a little dummy dataset with columns that I used for examples with/without replicates and faceting. Here's what each case looks like: replicates-faceting-median replicates-faceting-mean Note that I took these two screenshots (no faceting mean/median) prior to adding the subtitle that contains the average method. replicate-handling-no-facet-mean replicate-handling-no-facet-median no-replicates-faceting no-replicates-no-faceting

There is also a tooltip hover for the actual data points that contains all of the metadata. Let me know if there's any other functionality that should be added for Oct release (tests still need to be added).

lizgehret commented 2 months ago

Another note @gregcaporaso is that the way the inputs are configured, there won't be a default for 'average' since it has to be optional (for cases where replicates will be set to False). So users just have to explicitly set either 'median' or 'mean' for their average if they've enabled replicates.

ebolyen commented 1 month ago

@lizgehret there's 3 smallish merge conflicts it looks like