rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

question: 0.250 values from mash when building distance matrix #5

Closed spock closed 4 years ago

spock commented 4 years ago

Thanks for an interesting tool, Ryan!

This is my first attempt running trycyler on a sample/genome which (for some yet-unknown reason) ends up too fragmented even with decent (~50x) PacBio coverage. (A few other nearly-identical samples get assembled very well even at ~30x.)

I've used mash a few times before for small-group and pairwise whole-genome comparisons, so I am surprised to see a particular output.
At the stage of building a distance matrix with mash, I am seeing a peculiar pattern of repeated 0.250 values (wrapped for somewhat better readability):

A_sample_098: 0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.116  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.000  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250  0.244  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  0.250  
0.250  0.250

This goes on and on, pages and pages of scrollback buffer :) , with occasional different values.

The actual question is: is this a normal/expected behavior, or a bug in my local environment?
Resulting dendrograms look fine, with variable branch lengths and realistic-looking clustering.

rrwick commented 4 years ago

My apologies - that wasn't very well explained in Trycycler's output! I have found Mash distances get somewhat unreliable with higher values, so I capped them at 0.25. I.e. any Mash distance over 0.25 essentially means 'not closely related'. Capping the distances helped make the trees a bit more manageable.

So the short answer is yes, this is normal/expected behaviour.

I'm a bit more concerned about the part where you said 'pages and pages of scrollback buffer'. In most cases, a nice input assembly for Trycycler will only have a few contigs. This is because Trycycler is really intended to work on completed genomes, and most bacterial genomes only have a few replicons (I think the most I've seen is ~10). If you have some input assemblies with lots of contigs, I would worry that they are fragmented and not really suitable for use as Trycycler input. So you might get cleaner results by doing a bit of manual curation on your input assemblies (e.g. tossing out assemblies that look fragmented) before running Trycycler cluster.

And I will definitely put some thought into making the Trycycler cluster output less confusing - thanks!

spock commented 4 years ago

hi Ryan, thanks for the explanation! Somehow didn't have the intuition to grep the code for that value :)

You are correct, this particular assembly is troublesome. Moreover, it's not even bacterial, it's a small fungus.

(I do realize that I should exercise the same level of caution with Trycycler as with Unicycler when applying to non-bacterial species, primarily because of circularization.)