Identify leaf sequences that have highest sequence identity to each internal node (i.e. lineage member)

lauradoepker commented 7 years ago

Vladimir Vigdorovich suggested the following in a conversation with @psathyrella, @metasoarous, and me.

To validate that our reconstructed ancestral lineages are accurate, identify leaf sequences that are highly similar in sequence to each node within the lineage. Therefore, we can say we are fairly confident in our lineage reconstruction because we actually sequenced some of these ancestors. Or not 😣 .

I expect this would look something like: 1) Pull out each node sequence, in turn 2) Compare each node sequence to every other sequence in this seed's partis partition (NOT the pruned/trimmed sequence set) 3) Pull out the top three leaf sequences that have the closest sequence identity 4) Repeat for every node 5) (??) Display these findings by grouping them by internal node - i.e. I could click on a node within the lineage on CFT and it would show me the top three leaf sequences even if they aren't part of the tree because they were pruned out.

matsen commented 7 years ago

If we are using a pruning strategy we should be getting all of the sequences that are close to the seed lineage. If we are using trimming that would be another story, but if we are curious about sequences close to the seed lineage why wouldn't we be using pruning?

lauradoepker commented 7 years ago

Good point @matsen . We can just identify the leaf sequences in the tree that are closest to each node. This could be done manually, I suppose, by just looking at the tree. We could either mark them with some color/symbol that identifies them as node-similar leaves or generate a list like I originally suggested.

metasoarous commented 7 years ago

I think with #187, this should become more visually obvious as well.

I could possible highlight or annotate the node visually somehow. But this may be easier to solve when we are on auspice ( #99 ), and it's something we'll ultimately have to "re-solve" there anyway. So if you're satisfied with visual identification for now (post #187) @lauranoges, perhaps we can wait for #99?

lauradoepker commented 7 years ago

Sure, yes, that'll do for now @metasoarous !

metasoarous commented 5 years ago

Issue moved to matsengrp/olmsted #135 via ZenHub

metasoarous commented 5 years ago

Actually... realizing that there may still be some work to do on the CFT side here, since you would probably want to look through sequences that had gotten pruned out, and so that would have to be processed in CFT.

matsen commented 5 years ago

Planning to just use sequence identity. Can just use http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc87 or BLAST.

lauradoepker commented 5 years ago

My current process that was used for the QA255 project was: Alignment of inferred intermediates + custom BLAST search of clonal family for NGS seqs close to all inferred intermediates, naive, and seed sequence (i.e. the seed lineage). Criteria for an NGS sequence being “close enough” to seed lineage to be included in alignment = closest sequence to each individual inferred lineage member as long as it’s >90% identical

eharkins commented 5 years ago

As Chris pointed out, we would want to consider any sampled sequence (perhaps even beyond the partis cluster, though ideally sampled sequences closely matching inferred ancestors would have clustered with those sequences on which the ancestor was inferred).

This seems like something we can do in CFT, but is a non-trivial additional step. @lauradoepker should decide priority.

eharkins commented 5 years ago

Regarding my last comment, @lauradoepker just cares about checking for similarity of sequences within cluster.

In terms of doing this, we discussed taking the top N similar sequences in the cluster for a given inferred ancestor, since this experiments only allow for testing so many different options and a % similarity cutoff would potentially include vastly different numbers of sequences for different inferred ancestors.

@lauradoepker Should we report the % similarity anyway, so you still have an idea for how similar the top N are or will you eventually end up loading them up in Geneious?

lauradoepker commented 5 years ago

Thanks @eharkins. Reporting percent would be great and would allow me to determine if they're even worth pulling to begin with. Though, we will load them up in Geneious anyway.

eharkins commented 4 years ago

Today we talked about what to call these sampled sequences that are close to inferred ancestors, we decided on something like near_ancestors. However, I might just go with sampled_ancestors. While it's true that more often than not, we will probably end up not having sampled the exact sequence of an inferred ancestor and so will be reporting sampled sequences close to an inferred ancestor, I think the term "sampled" gets more at the driving question behind looking for such sequences: we are hoping to find that we have sampled an inferred ancestor (or something close to it). cc @lauradoepker @matsen

metasoarous commented 4 years ago

I think you want descendants. I'd also argue for including nearest even if you also include sampled, since you want to clarify that you're not just interested in all sampled descendants, but specifically the nearest. So maybe nearest_sampled_descendants?

eharkins commented 4 years ago

Thanks for the feeback @metasoarous! I think descendants is less specific than ancestors since what we are looking to see is whether we have sampled the ancestors (based on our inferred definition of each of these ancestors) of a given set of sampled clonal family members. Though maybe I am misinterpreting what you are getting at with descendants (of who/what)?

I'm also confused by your use of the word nearest here

you're not just interested in all sampled descendants, but specifically the nearest.

My understanding was that near meant close in sequence identity to an inferred ancestor. So if we did sample any inferred ancestors exactly, none of them is more near than any other to their corresponding inferred sequence. However, it seems like often times

we did sample any inferred ancestors exactly

wont be the case, so we care about whether we sampled something near (in sequence identity) to an inferred ancestor. But I think this is a nuance of our goal vs our reality/implementation, hence my reason for omitting this word altogether:

While it's true that more often than not, we will probably end up not having sampled the exact sequence of an inferred ancestor and so will be reporting sampled sequences close to an inferred ancestor, I think the term "sampled" gets more at the driving question behind looking for such sequences: we are hoping to find that we have sampled an inferred ancestor (or something close to it)

metasoarous commented 4 years ago

I think descendants is less specific than ancestors

It's not less specific; In fact, it's a direct inversion of the relation: Descendant(A, B) <=> Ancestor(B, A). But that's exactly to my point. If this is a property of the putative (inferred) ancestor, then you want to be talking about what descends from that ancestor.

Hopefully this also clarifies why I think including sampled is more precise. An internal node might have other internal nodes as descendants, but we're not really as interested in them as we are in those descendants which have actually been sampled (i.e., the leaves).

Speaking of which, sticking with leaves would perhaps be fine in lieu of sampled_descendants, since it implies more or less the same thing (though perhaps not as directly). And maybe even better if we want to allow that one of our nearest_leaves be a "close cousin" (if you follow my meaning).

I don't care too much about near vs nearest, but again, it clarifies that a deep internal node could have lots of descendant leaves, so we'd want to be clear that we're only interested in the ones that are nearby in sequence similarity.

eharkins commented 4 years ago

^ Prioritizing implementation over naming for now.

Right now I have a script that takes two fastas as input:

Sequences from which to create a local BLAST database (sampled cluster sequences in our case)
Sequences with which to query said database (inferred ancestors in our case)

It outputs a TSV of all the matches for each query sequence, including percent identity, evalue, length, etc.

Two use cases I am aware of are: Querying among sampled cluster sequences to search for close matches to:

CFT (dnaml / soon to be raxml-ng) inferred ancestors
Linearham inferred ancestors

@lauradoepker and I just talked about how what she really cares about is the latter. In that context, she would like a fasta output for each inferred ancestor, aligning it with all of its BLAST matches (can call this "blast-matches-alignment" for reference).

So I can add this as an option to the script (e.g. --write-blast-alignments) and right now the plan is for Laura and Mackenzie and anyone else working with Linearham output to be able to run this script and generate one "blast-matches-alignment" for each inferred ancestor from Linearham output; they would need to curate the query sequences (inferred ancestors) since in Linearham's context, the set of inferred ancestors depends on interpreting the lineage uncertainty.

In the CFT context, we have up to ~100 inferred ancestors for each clonal family so by default we should probably not run this blast script in CFT with --write-blast-alignments so we don't end up with up to 100 "blast-matches-alignment" files for each cluster. For now I will just output the BLAST results TSV for each clonal family in CFT, and we can build in download to Olmsted later if we end up wanting "blast-matches-alignment" files for CFT inferred ancestors often enough.

lauradoepker commented 4 years ago

@eharkins , this output looks great when I check manually. That being said, it's already really tedious and time-consuming to be checking manually because my lineages have so many members and there are so many similar NGS "hits" for them (yay!).

Thus, I'd like to request that we finish this issue by creating a single output csv that pulls the following into columns: 1) query name 2) query nt sequence 3) highest identity hit name 4) highest identity hit sequence 5) highest identity %ID (nt) 6) all other columns in your current tsv (alignment length | mismatches | gap opens | q. start | q. end | s. start | s. end | evalue | bit score) 7-10) repeat (3-6) for second highest identity hit

Can you do this both per-lineage (i.e. all queries per seed and output it into the per-seed output directories) and also concatenate this for the whole project (i.e. all seeds together in one document and output this into the main output directory)?

eharkins commented 4 years ago

@lauradoepker Would you accept having just the top two hits in two separate rows for each query instead of sticking them both into a single row? You could still see which one was top hit by checking %id and it would make more sense in terms of reading the file programmatically (which we do later to create an alignment of the query against its hits) where each row describes a single hit.

lauradoepker commented 4 years ago

Absolutely @eharkins. That sounds great.

eharkins commented 4 years ago

https://github.com/matsengrp/cft/pull/296 achieves this and can be run outside of CFT via https://gist.github.com/eharkins/d2695a858425a474235298aac853a97e

matsengrp / cft

Identify leaf sequences that have highest sequence identity to each internal node (i.e. lineage member) #186