NEW: adds boxplot - Githubissues

lizgehret commented 2 months ago

Copying over my proposed design from basecamp for visibility:

Here's my proposed design for boxplot in q2-vizard:

This visualizer will take a numeric measure (distribution) and a categorical measure (facet_by) for constructing the box plots. Users will input the average method they'd like to use (mean or median with median as the default) as well as the whisker range (choices here, percentile, or IQR - let me know what you think makes sense for this). Any data points that fall outside of their selected whisker range will be plotted as individual points to represent outliers. Within the actual visualization, there will be a transpose signal that will allow for them to swap the orientation of the box plots (either horizontal or vertical, with horizontal being the default).

My initial thought is that the inputs will be fixed vs. allowing for a drop down to change how the box plots are grouped because this adds a lot of extra overhead in what's pre-computed (prior to the vega spec being rendered) - but it would still be possible if you think that's something that will be really helpful/commonly used by folks.

Here's a rough sketch of what this proposed design would look like:

nbokulich commented 2 months ago

@lizgehret this is great! I will be interested to test this once it is ready! (just tried but could not get it to work with real data; happy to share an error log if this is unexpected but I assume I am just jumping the gun 😁 )

I think that a drop-down for facet_by would be useful. Often when plotting data, users will want to look at different groupings. To give some concrete examples: with environmental/soil data like the EMP data, users might want to look at distributions at different EMPO levels (i.e., different types and subtypes of ecosystems); in human data (e.g., HMP), maybe different body sites and subtypes, or patient categories; In the PD mouse dataset or similarly structured data, they might want to look at multiple categories in the metadata like "host", "donor", and "treatment". Having all of this in a single plot would be convenient; though alternatively there could be multiple plots displayed instead of a drop-down, and the user could input a list of categorical column names to facet_by.

I think that a drop-down for the numeric measure (distribution) would be useful. E.g., if plotting alpha diversity per group, a user might want to toggle between multiple metrics (also for beta diversity, e.g., distributions of pairwise distances). Alternatively, there could be multiple plots displayed in the viz, one per measure selected (distribution could accept a list), but I like the dropdown.

I suggest making percentile the default for whiskers, but it is always a matter of taste and both are common.

lizgehret commented 2 months ago

Thanks for the feedback @nbokulich! I will definitely let you know once this is ready for a test drive - I've still yet to fill in the vega spec 😅 Here are some design updates after a discussion with @ebolyen this morning:

Drop downs will be left out for now, but vega transforms will be used for handling box/whisker ranges so this could be added in later once our more generalized data inputs (Dist 1D, etc) are used. This ensures that users are being mindful of what data they're putting into this kind of visualization that's a bit more specialized than a basic scatter plot (since we're providing visual representations of data distribution).
Box extent is always Q1 - Q3 (clarification for myself).
Multiple average methods won't be supported for the center line - it will always represent the median.
Whisker options will be:
- 1.5 IQR (Tukey's IQR)
- min/max
- percentile (9/91 - this mirrors demux summarize box plot)
Any data points that fall out of the chosen whisker range will be plotted as outliers - and a 'suppress outliers' param will be added & that info will be reflected on the plot at the top description of the boxes.
Transpose signal supports either horizontal or vertical representations for each group.

Now that things are a bit more fleshed out, I'm going to start working on the actual spec. Should be in a working state sometime next week!

lizgehret commented 2 months ago

The transpose signal is most likely getting punted to v2 because vega doesn't like swapping axes of differing types: https://github.com/vega/vega/issues/1176

lizgehret commented 2 months ago

note to myself: also need to add the same legend[data] hack if group_by field is none

lizgehret commented 2 months ago

This is not quite finished but is now ready for some test driving! 🚘 cc @ebolyen wanna take a look and lmk if there's anything that could be improved/changed? I'll follow up on any requested changes when I'm back next week 🙂

outstanding to-do's:

[x] debug tukeys_iqr (results currently look identical to minmax)
[x] tooltip hover w/summary data for each group
[x] add box_alignment param for either vertical or horizontal box alignment
[x] add additional subtitle text to include 'whiskers drawn using the whisker_range method'
[x] selenium test suite

lizgehret commented 1 month ago

Okay, this is finally ready for review @ebolyen!

I may have gone a little overboard with the test suite... but I wanted to make sure all of the visual elements were being tested, as well as the actual stats calculations. I think I've been staring at this for too long, so if anything doesn't look reasonable or there's a better way to organize things, let me know!

qiime2 / q2-vizard

NEW: adds boxplot #29