rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Large differences in pairwise identities #61

Closed ashleyp1 closed 1 year ago

ashleyp1 commented 1 year ago

I'm trying to reconcile my contigs and ran into a problem at the pairwise identities check where a large proportion of my contigs have values of 59%. I'm hesitant to fix it by removing them since that would require me to remove at least 5 contigs, including all of my raven assemblies, but I'm also unsure if just lowering the min_identity threshold is right either.

Pairwise identities:
  A_Utg620:     100.00% 59.03%   59.02%   59.03%   92.11%   59.03%   59.03%   98.09%   91.30%   98.07%   59.03% 
  B_contig_4:   59.03%  100.00%  99.68%   99.69%   59.03%   99.98%   99.98%   59.03%   59.08%   59.03%   99.69%
  C_utg000001l: 59.02%  99.68%   100.00%  99.96%   59.03%   99.67%   99.67%   59.04%   59.08%   59.03%   99.38%
  D_utg000001l: 59.03%  99.69%   99.96%   100.00%  59.03%   99.69%   99.68%   59.04%   59.08%   59.03%   99.39%
  E_Utg606:     92.11%  59.03%   59.03%   59.03%   100.00%  59.03%   59.02%   90.39%   84.05%   90.37%   59.04% 
  F_contig_2:   59.03%  99.98%   99.67%   99.69%   59.03%   100.00%  99.99%   59.03%   59.08%   59.03%   99.69%
  G_contig_2:   59.03%  99.98%   99.67%   99.68%   59.02%   99.99%   100.00%  59.03%   59.08%   59.03%   99.69%
  H_Utg612:     98.09%  59.03%   59.04%   59.04%   90.39%   59.03%   59.03%   100.00%  93.06%   99.94%   59.03% 
  I_utg000001c: 91.30%  59.08%   59.08%   59.08%   84.05%   59.08%   59.08%   93.06%   100.00%  93.07%   59.08% 
  J_Utg602:     98.07%  59.03%   59.03%   59.03%   90.37%   59.03%   59.03%   99.94%   93.07%   100.00%  59.03% 
  K_contig_1:   59.03%  99.69%   99.38%   99.39%   59.04%   99.69%   99.69%   59.03%   59.08%   59.03%   100.00% 

I mapped the contigs against each other using nucmer (making graphs similar to the dotplot function but it runs a bit faster) and there doesn't appear to be that big of differences. Do you have any recommendations on how to move forward with this?

rrwick commented 1 year ago

It looks like you have two groups of contigs:

I suspect that you have some sort of structural rearrangement going on here, e.g. a large inversion of some sequence. Perhaps there is heterogeneity in your sample, i.e. a mix of two different large-scale structures, and assemblers are settling on either one or the other. Since Trycycler does a global alignment, a big structural difference can lead to very low identities.

You could confirm this by looking at the dotplots, nucmer alignments, or Mauve alignments. If it does look like a structural rearrangement, I would pick one (perhaps arbitrarily) and then delete the others. For example, just use the group 2 contigs.

Cases with heterogeneity are some of the trickier scenarios when doing a Trycycler assembly!

Ryan