Add indel tree analysis to pipeline?

eharkins commented 5 years ago

In cases like #279 and #285 we are:

determining the set of sequences in the cluster that have the indel identified in the seed
seeing where those sequences cluster among the broader cluster tree (FastTree and ML tree)
using this information to decide whether we think the indel inference is legit.

My way of doing this is semi-automated in that I am using shell scripts to run bin/annotate_fasttree.py using various ingredients from different runs (indel matching vs not)

As @lauradoepker pointed out today, it seems possible to fully automate this analysis when running cft using --match-indels-in-uid. Is this something we want to do?

Maybe some questions that would help answer the above are:

Do we anticipate doing this every time an indel is discovered in a seed (sounds like probably)?

Maybe more importantly, what are the possible courses of action we would take based on the results of an analysis like this? e.g. continue analysis ignoring the indel (using indel reversed sequences) vs some alternative?

In the case that we don't add this to the pipeline, I will at least want to commit my script for such an analysis to this repo, so that we can track that as a recognized way to analyze cft output.

psathyrella commented 5 years ago

Correct me if this is included in one of those already, but another important piece is if the potentially-indeld sequences align well to different genes if we don't allow them to have indels.

eharkins commented 5 years ago

Nope, I forgot that, thank you @psathyrella! Though this is probably not as likely to be part of cft pipeline, it is still of great importance

eharkins commented 5 years ago

In the case of analyzing a particular indel, it sounds like @lauradoepker mostly cares about having the contextual analysis of seeing which sequences in the lineage have a particular indel (where do we think the indel was introduced).

In #284 we added the ability to --match-indels-in-uid which actually dynamically modifies the annotation and just proceeds with the CFT analyses with only those seqs matching the indel for a given sequence. We also added the ability to not check for indels in a seed (--ignore-seed-indels) with the idea being that we run once on the same seed with each of --ignore-seed-indels, and --match-indels-in-uid and use the two ouputs to create graphics showing the contextual analysis I describe above which currently looks like this (where red are matches from --match-indels-in-uid)

If we don't gain anything from having the two runs described above as separate outputs and we just want to see the see the indel of interest in the context of other clonal sequences for each of

FastTrees
ML trees

We can

extend existing FastTree options
show this on Olmsted by adding some metadata to sequences that have the indel. (or output another png if that is preferred, but it seems like we will want the option to look more closely at this lineage, and so Olmsted seems like an appropriate place to do so.

matsen commented 5 years ago

I'm a little confused about splitting the two sequence sets apart and then putting them back together. Why not just run prune.py on the entire set and then use the indel matching to color things?

(We can chat IRL tomorrow...)

eharkins commented 5 years ago

I think that is basically what I propose and what Laura probably wanted in the first place, but I misinterpreted it as needing to run cft separately for the indel containing subset vs the entire cluster.

eharkins commented 5 years ago

@lauradoepker said:

Both your confusion and our new plan are “right” in that, yes, we agreed to run CFT on both the indel filtered and unfiltered sequences because, ultimately, that’s how linearham would be run, most likely. However, in this context, since we are already 90% done with wet lab work that was run based on unfiltered sequences going through linearham, all I needed to know what where/when did the insertion happen and the red-node-labeled ML tree helped me do that pretty quickly: it happened quite early. So - this issue can just simmer until I’m back to answer questions, if that’s easiest.

One of the only curious problems we’re seeing right now is that the indel filtered sequence set accidentally pruned our seed out, which can’t happen.

I said:

Sounds good. So do we want to

preserve the ability to run on just indel containing sequences for the sake of running linearham that way and then also

add in the automated fasttree + ml tree analysis I was talking about in the issue?

I will prioritize other things for now but this is definitely still important so that next time this happens I don't write one-off scripts again

@lauradoepker said:

Yes and yes. Great

I said:

Regarding "indel filtered sequence set accidentally pruned our seed out, which can’t happen.", this is a direct effect of adding _indel_filtered to sequence IDs. There will probably be other weird things like this if we continue to do it this way, so we need to think about whether it's worth making the changes to remove that tag in various places when comparing, or whether there is a better way to handle it in the code and still allow you to have that tag added when you download a fasta (esp if you are only getting these from Olmsted, we could just tack it on there)

TL;DR I am prioritizing developing other cft issues, but it sounds like we don't need to run two separate CFT runs. For the purpose of analyzing indels, CFT's should be used to highlight the indel-ed seqs in the tree(s) and to generate any clonal family fasta we may need to run via linearham (see #276 for how we plan to do this in the future).

metasoarous commented 5 years ago

Ideally, CFT should be handling this by tracking any changes in seed name due to indel filtering logic. That way, when minadcl and/or seed lineage pruning take place, they are operating under the correct assumptions about which sequences they should try to preserve in the output (namely, the seed sequences). Obviously, if there are other priorities at the moment, this can wait, but if this functionality stays in the pipeline, we should fix it, at least eventually.

eharkins commented 5 years ago

@metasoarous I think we have agreed to not change sequence IDs for sanity and to still satisfy Laura's wish to know about indel containing / reversed sequences by attaching this information as metadata to a cluster so that files downloaded from Olmsted will be able to have names reflecting their indel status.

eharkins commented 4 years ago

Closed in #309

matsengrp / cft

Add indel tree analysis to pipeline? #286