matsengrp / ecgtheow

Ancestral lineage reconstruction using BEAST or RevBayes
2 stars 2 forks source link

naive0 missing from graphic output #12

Open lauradoepker opened 6 years ago

lauradoepker commented 6 years ago

From slack: @dunleavy005 I generated a tree (graphic) that had no naive sequence at the top, so I’m suspicious that the prune.py script is not working properly to keep the naive0 sequence. @csmall just mentioned this as a potential problem in our CFT work, so maybe it’s a problem here? My question for you is: does the prune.py script get updated regularly and/or have there been any changes?

image

this was from laura-mb dataset

I think the problem we’re experiencing with the current output is that not all the lineage pathways are beginning from naive0. I reduced the filter to 50 instead of 100 and got this graphic:

bf520 1-igh family_0 healthy seedpruned 100 aa_lineage_graph

@dunleavy005 where do we specify that all the lineages of interest should start from --naive and progress to our --seed?

dunleavy005 commented 6 years ago

"where do we specify that all the lineages of interest shouls start from --naive and progress to our --seed?"

We don't actually, looking at https://github.com/matsengrp/ecgtheow/blob/master/python/trees_to_counted_ancestors.py#L149, we only plot individual edges that have counts >= filter threshold. Since our analysis tabulates edges individually, there isn't any "requirement" that the naive should be the root node in this plot. Many times, it's happened, but it isn't required to. Did you see what it looks like when filter=0? We should have the number of edges emanating from naive0 to equal num_trees.

dunleavy005 commented 6 years ago

WRT to prune.py, I'm haven't touched it ever. https://github.com/matsengrp/cft/blob/master/bin/prune.py#L164 seems to suggest the naive/seed are always included in the pruning, so I'm not sure if Chris has found a bug related to this.

lauradoepker commented 6 years ago

Yes, after doing some ground work, prune.py isn’t the issue but it's probably what you said: that naive0 can escape our final graphic if it doesn’t meet the filter threshold.

lauradoepker commented 6 years ago

Interesting: This graphic ran with filter = 10 ./python/trees_to_counted_ancestors.py --seed BF520.1-igh --burnin 1000 --filter 10 ./runs/BF520.1-igh.family_0.healthy.seedpruned.100.trees data/BF520.1-igh.family_0.healthy.seedpruned.100.fasta shows the naive0 in the little white circle on the upper right while other lineages don't involve the naive0 sequence. @dunleavy005 @matsen . My gut reaction is to question whether or not we have inferred the correct naive0 sequence in this particular partis run. What are you two thinking?

bf520 1-igh family_0 healthy seedpruned 100 aa_lineage_graph

matsen commented 6 years ago

If you stare at the sequence alignment does it look reasonable?

These graphs are highly processed results-- there are many other crazy things that could be happening eg the rogue taxon stuff.

lauradoepker commented 6 years ago

Yeah it looks like the naive could be an inferred allele with a D (in red)... but since it wasn't inferred that way maybe it got left out.

screen shot 2017-10-12 at 4 32 25 pm

lauradoepker commented 6 years ago

@dunleavy005 Shouldn't all our 10,000 BEAST trees be forced to be rooted on naive0 (and ultimately end with our given seed)? Is BEAST currently constructing unrooted trees from the pruned sequence list?

dunleavy005 commented 6 years ago

No, I'm pretty sure BEAST is estimating rooted trees (with an outgroup of naive0). But yes, there is always a path from naive to seed, so in theory the naive0 node should have a total of 10000 transitions accounted for in the graph when filter=0. I can't really see anything in the picture, can you see this?

lauradoepker commented 6 years ago

By my logic, I think naive0 should be transitioning to inferred_5_3135 here but it's apparently not.

bf520 1-igh family_0 healthy seedpruned 100 aa_lineage_graph

image

dunleavy005 commented 6 years ago

Just want to summarize my findings here for @matsen:

cnt = 0
naive_edge_c = Counter()
for ((a,b), count) in edge_c.most_common(None):
    if a != b and (a in special_seqs and special_seqs[a] == "naive0"):
        naive_edge_c.update([count])
        cnt = cnt + 1

This snippet looks across all the (parent, child) edges in the Counter object, and computes the number of unique naive0--->X transitions (cnt) and determines the discrete distribution of the edge counts for each unique transition (naive_edge_c).

In total, there were 7924 unique naive transitions. The associated transition counts are distributed as follows:

Counter({1: 7726, 2: 78, 3: 19, 23: 8, 4: 7, 15: 5, 16: 5, 19: 5, 20: 5, 5: 4, 11: 4, 13: 4, 17: 4, 18: 4, 21: 4, 25: 4, 31: 4, 22: 3, 28: 3, 6: 2, 8: 2, 14: 2, 24: 2, 26: 2, 36: 2, 9: 1, 12: 1, 27: 1, 30: 1, 32: 1, 34: 1, 37: 1, 40: 1, 41: 1, 45: 1, 52: 1, 54: 1, 60: 1, 61: 1, 73: 1, 100: 1})

Clearly, most of these transitions appeared just once, which is why Laura's images don't have any/many naive0 transitions; out of 10,000 trees, the most confident naive0 transition was observed 100 times. I was thinking this uncertainty was due to (relatively) long root branch lengths and this may be the case.

"Good" trees (from other runs): image image

Randomly sampled tree (this run): image

Definitely, some weird stuff going on here. I feel the problem may be model/data related and not an issue with ecgtheow post-processing.

matsen commented 6 years ago

Thank you, @dunleavy005 ! Is this the most recent 125 set, https://matsengrp.slack.com/files/U34B9Q3HV/F7HR6SLSJ/bf520.1-igh.family_0.healthy.seedpruned.125.fasta ? If so, I can share my RevBayes results (perhaps after running a little longer).

lauradoepker commented 6 years ago

^I believe @dunleavy005 was using recent 100 sets -- I've used our default nprune value lately.

matsen commented 6 years ago

So, to be clear, this is /home/matsengrp/working/matsen/ecgtheow/BF520.1-igh.family_0.healthy.seedpruned.100.fasta, @dunleavy005 ?

dunleavy005 commented 6 years ago

Definitely, BF520.1-igh.family_0.healthy.seedpruned.100.fasta from laura_mb data, but don't know much else. @lauranoges care to comment?

lauradoepker commented 6 years ago

This is from my own home folder, so you don't have access. The difference is that it's the latest runs (20170929) on the latest partis versions - which is potentially different from what you're running on (20170822). I've made a copy of this in your folder, with 20170929 appended to the filename @matsen

matsen commented 6 years ago

OK, thanks! Can we have a new rule, though? Anytime we post an important analysis, we post a path to where the sequence data lives. That way we know what we're talking about.

This is already sort of a mess, in that you are copying over a file at a time. Can we have the sequence data live in some shared location, such as nicely organized subdirectories of /home/matsengrp/working/lnoges?

lauradoepker commented 6 years ago

Sure, I'll create new runs there now as I have to redo them anyway on Duncan's non-downsampled datasets.

matsen commented 6 years ago

Thanks, I now see /home/matsengrp/working/lnoges/ecgtheow/data/laura-mb-2_v4_IgG_merged_20171021/BF520.1-igh.family_0.healthy.seedpruned.100.fasta which I will assume is the latest fasta file.

matsen commented 6 years ago

Er, it's actually different than the one you put in my directory, so I'll use the one you put in until I hear otherwise.

matsen commented 6 years ago

In a RevBayes consensus tree, dg-321335-1 comes out with a huge long branch. I think it's our rogue.

screenshot 2017-10-22 at 7 41 05 am

So I'm running without it.