Add clonal family viz attributes

metasoarous commented 6 years ago

Want to be able to make some additional clonal family statistics available for visualization encodings. To be clear, a lot of the work here is in cft, but the corresponding output data will also have to be piped in here. (:postal_horn: from Horn paper)

[x] clonal family size
[x] CDR3 length
[x] mean mut_freq
[ ] mut freq quantiles
[ ] skewed site frequency spectrum :postal_horn:
[ ] Fay-Wu (measure of SFS/positive selection)
[ ] shm variance
[ ] Phylogenetic Diversity (PD) of minadcl asr trees
- [ ] also of FastTrees?
- [ ] under rarefaction?
[x] Gene usage - we already have this, but could maybe make it possible to color or shape or facet by?
[ ] pattern of hydrophobicity
[ ] LONR score (imbalanced tree structure + amino acid change)
[ ] persistence of clonal families through time :postal_horn:
[x] local branching rate :postal_horn:
[ ] Other?

eharkins commented 6 years ago

Just to clarify, these would result in more options for the x and y axis dropdowns in the scatterplot/ initial visualization, correct?

metasoarous commented 6 years ago

Yes, as well as size, color and potentially even shape.

eharkins commented 6 years ago

There exists a list of statistics that are candidates for this, we should consult Duncan and then determine which are v1 priorities

metasoarous commented 6 years ago

I believe the lists are from matsengrp/cft#188, specifically:

@psathyrella Anything else we should add to this list?

metasoarous commented 6 years ago

I updated the issue text with the items from the above lists.

eharkins commented 5 years ago

Would love to get started on this, @metasoarous are you available to chat at some point about which of these statistics are available to cft from earlier steps in the pipeline, and how to start outputting those - if we are not already - so that we can query for them in the Olmsted data script? As opposed to which of these Duncan is maybe still working on or are not something we are calculating at all yet.

metasoarous commented 5 years ago

We already have all the data we would need for gene usage (v_gene, j_gene, etc). We could use that for any of x/y, color, shape or facet.
One of us could pretty easily write a script to compute PD from a tree, and have that plugged into cft.
If you look at where mean_mut_freq is computed, we could easily add the other mut freq quantiles and variance computations there.
I know Duncan is working on local branching rate and maybe also LONR; Anything else @psathyrella? Please let Eli know if he can help with anything.

That's what I know right now. Everything else would have to be looked at a little more deeply and implemented into CFT.

psathyrella commented 5 years ago

oops, sorry, not good with github notifications

My priority would be [edited on oct 23]:

shm quantiles
local branching index
local branching ratio
affinity (yeah, on data we will generally have this for only a small number of sequences, but it provides the entire reason for thinking lbi/lbr are useful, so we need it for validation.

Also, generally to be useful, lbi + affinity and lbr + affinity kind of need to be on the same tree, so we can see how well each is correlating with affinity.

@lauradoepker 's opinion probably matters as much or more, though.

metasoarous commented 5 years ago

@psathyrella I know you've been working on the local branching index, but which of these other metrics here have existing (& open) implementations we might be able to use? Also, I wasn't able to immediately locate the Horn paper. Would you please point me to that?

Regarding PD specifically, it looks like scikit can do this, though the API looks a little weird, and is py3 only (http://scikit-bio.org/docs/0.4.1/diversity.html). Some Googling also brought up this implementation (http://www.cibiv.at/software/pda/), but I thought I'd check to see if you had any suggestions here. CC @matsen?

psathyrella commented 5 years ago

To directly answer you question, nothing really has existing usable implementations, but that's ok since what we want now we have implemented in partis, and what we might later want we'd be making from scratch.

I'm editing my previous comment so it's not confusing, but the update is 1) lonr has enough problems that it's not really usable as-is, but I figured out a way to combine the lonr and lbi ideas that does what we want, so 2) we really just want lbi and lbr (ratio), and affinity now. The only other things I think that could be called metrics, in that they're not just trivial numbers like shm and persistence through time, are things like fay-wu and SFS, (which ok those two are implemented in partis), but more to the point, I think they're not high priorities at the moment since they're hard to interpret, because they need a baseline that everyone seems to disagree on.

I forget offhand which horns paper, but it's one of these two.

Everything I know about PD is from just reading the wikipedia page, but it sounds awfully similar to lbi and lbr, and I'm reluctant to add a metric that correlates strongly with other metrics we have, and for which [i'm guessing here] we don't have a strong biological interpretation? My feelings on this are colored by having spent so much time in the last year looking at fay-wu and SFS plots, and in the end never really knowing whether they meant anything in our context or not.

metasoarous commented 5 years ago

This is very helpful; Thank you.

I didn't realize you'd implemented a bunch of these in partis already, so that's great. Where can I find this functionality?

My understanding is that lbi and lbr are per node metrics within the tree, is that right? Are you also computing aggregates thereof?

I guess it's been a while since I thought about PD... Normally it doesn't make sense to think about PD of a tree by itself, because its basically just the sum of the branch lengths in the tree, and this metric is obviously sensitive to sampling depth. What you can do though when you have samples with different sizes is compute PD under rarefaction, that is, compute for a bunch of random downsamplings in order to compare on more equal footing. In our case though, where most clusters we care about are rather big, we've already downsampled to 100 sequences, we don't really have to worry about this so much, and I guess I can just add a simple sum of branch lengths as a measure of how much evolutionary history is represented in the tree. I forgot that the metric was literally that simple, because of the complexity typically associated with computing it under rarefaction. Anyway...

psathyrella commented 5 years ago

Yeah these are what I was adding to the output with my cft fork. Checked in now, there's this and this, the former of which is just a reminder to me of where to put it to run on fasttrees, the latter is how things are done now -- you run another partis process, passing in a tree, and it adds the tree metrics to the output file. I was imagining I'd pull just the tree metrics out of the output file and add those to cft's output with ingest=True, but that's a trivial change.

No, they only really make sense for comparing nodes within a tree. If you're short on lineage-wide metrics for testing or something, the shm stuff or fay-wu/SFS could be used as well.

Yeah, I could see PD being useful, I don't mean to sound like we should absolutely minimize our metrics, just that if they're a lot of work (can't tell if it is) we should probably have a more concrete idea of what we'd use it for, and that they should be easily removable/addable, i.e. the code for one doesnt depend on another.

edit: it's quite likely that I'm just ignorant of what we'd use PD for, rather than that it isn't useful.

metasoarous commented 5 years ago

Great! I'll merge that work in, and sort out the ingestion.

We are sort of short on lineage-wide metrics, so throwing in fay-wu/SFS would be great. I'm already doing mean aggregation on SHM somewhere, so should be able to add quartiles easily enough.

Definitely not a lot of work to throw in PD if we're not worrying about rarefaction, and it's plausible enough that it could be a useful signal in sussing out interesting clusters that I think it's worth adding. If PD ends up being useful, we can look at adding rarefaction, but that would be an issue for another day.

eharkins commented 5 years ago

We are now tracking the addition of per sequence metrics including LBI, LBR, and affinity in #106

matsengrp / olmsted

Add clonal family viz attributes #31