Cause of huge number of low-probability nodes

matsengrp / linearham

A Bayesian Phylo-HMM for B cell receptor sequence analysis

http://matsengrp.github.io/linearham

6 stars 4 forks source link

Cause of huge number of low-probability nodes #75

Closed psathyrella closed 2 years ago

psathyrella commented 4 years ago

@dunleavy005

@mmshipley and I are wondering if it's expected that (or what would cause this) reducing the size of a family e.g. by tightening clustering definitions would result in a dramatically larger number of very low probability nodes in linearham's output? I think the numbers are going from roughly 50-100 seqs to 10-20 seqs, and the 10-20 are going to all be in the same sublineage that we care about, which is why i wouldn't expect results to change all that much?

This is an example of what things look like with the smaller cluster (whereas they look fine with the larger cluster):

(This is only with one of the filtering settings, but none of the other settings really improve things -- the tighter ones give nodes that aren't connected to the rest of the asr).

dunleavy005 commented 4 years ago

My guess (without knowing anything about the family) is that the ASR is more uncertain due to the data loss and is just confused about substitution order. Looking at the image, I do see a somewhat certain path, does that path exist in the larger family graph too?

psathyrella commented 4 years ago

hmm, that sounds promising. @mmshipley does it seem possible that the main results are pretty much the same, there's just more low probability cruft? Maybe we can solve this as more a visualization problem, maybe don't display nodes below a certain probability (well i'm assuming that's what the filtering settings do, in which case I guess I'm suggesting maybe we can make the filtering settings smarter?).

psathyrella commented 4 years ago

She says yes, it seems likely that the main results end up the same. But I think it's still a real problem if it's easy to get diagrams like that ^ out, since most people will probably have the same response and ignore it and/or assume something's wrong (as happened here) so probably worth working on the filtering settings.

dunleavy005 commented 4 years ago

Yes, that makes sense, the filtering settings aren't perfect. FYI, I basically filter on edge probabilities (i.e. keep all edges if prob >= CUTOFF), maybe adding some node logic might improve it

psathyrella commented 4 years ago

Great. And thanks again for responding promptly, this is really helpful. It looks like jared will work on this in a month or so.

psathyrella commented 2 years ago

It seems like a better title for this at this point would be "fiddle with threshold settings for lineage path plotting", which is maybe worth doing, but without a working example that does something we don't like, I don't think there's any point in keeping this issue open.