matsengrp / sumrep

Summary statistics for repertoires
16 stars 6 forks source link

Summary plotting #11

Closed BrandenOlson closed 5 years ago

BrandenOlson commented 6 years ago

A next step for sumrep is a full plotting feature which takes a dataset and displays plots of each possible summary distribution. The current idea is to restrict to univariate distributions, but it might be possible to include bivariate summaries in a nice way later on.

Since there are at least a dozen summaries to plot, and each summary has its own considerations (discrete vs continuous, range, etc.), I propose to create a separate plotting function for each statistic under consideration. This is similar to how there is a comparison function for each statistic. So, for example, the pairwise distance summary statistic will have three corresponding sumrep functions: getPairwiseDistanceDistribution, comparePairwiseDistanceDistributions, and plotPairwiseDistanceDistribution. This will allow custom x and y labels, custom ranges for support, histograms vs densities, specific legends, etc.

This will also pave the way to a straightforward "master plotting" function which just iterates over each of these plot... functions, adds the plot to a list, and displays them all in a grid.

It would be nice to allow for multiple datasets to be plotted within each function as a future addition. This framework should make that relatively painless.

@matsen - let me know your thoughts when you can. I was hoping to implement this by the next software WG meeting, or at least finish a basic proof of concept.

matsen commented 6 years ago

I think that this is great, of course!

I remind you that we'd like to be able to leave the door open to displaying these sorts of summary plots as part of olmsted. Although that's not on the near-term roadmap, could you check in with @eharkins to think about intermediate data exports that could be consumed by him? Don't let this slow you down though.

BrandenOlson commented 5 years ago

The next step will be to add support for ecdfs and frequency polygons (using ggplot2). I might cut it off there as there are endless possibilities for plotting, unless anyone has specific requests.