rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Assertion Error during MSA #10

Closed jdakota1305 closed 3 years ago

jdakota1305 commented 3 years ago

Hi,

First just wanted to say thanks for creating such a valuable tool.

I am currently experiencing an issue when running the MSA step of the workflow. I am getting this:

MSA length: 1,756,900 bp Traceback (most recent call last): File "/home/djones2/anaconda3/envs/trycycler/bin/trycycler", line 8, in sys.exit(main()) File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/main.py", line 46, in main msa(args) File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/msa.py", line 36, in msa merge_pieces(temp_dir, args.cluster_dir, seqs) File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/msa.py", line 187, in merge_pieces assert seqs[n] == msa_minus_dashes AssertionError

Does it seem like there is a resolve for this? Any experience with this issue? Thanks in advance!

rrwick commented 3 years ago

Thank is a strange one! After the sequences are aligned in the MSA, the Trycycler code checks to make sure that they are the same sequences as before. I.e. if you take each sequence and remove the gaps (-), you should have the same sequence you started with. This was intended a sanity check and really shouldn't fail 😬

I can't say what the problem is without deeper investigation. Is there any chance you could share the 2_all_seqs.fasta file with me? If I could reproduce the problem on my end, I could get to the bottom of it.

Ryan

jdakota1305 commented 3 years ago

Thank is a strange one! After the sequences are aligned in the MSA, the Trycycler code checks to make sure that they are the same sequences as before. I.e. if you take each sequence and remove the gaps (-), you should have the same sequence you started with. This was intended a sanity check and really shouldn't fail 😬

I can't say what the problem is without deeper investigation. Is there any chance you could share the 2_all_seqs.fasta file with me? If I could reproduce the problem on my end, I could get to the bottom of it.

Ryan

jdakota1305 commented 3 years ago

2_all_seqs.fasta.gz

Hi Ryan,

Thanks for the quick response. I have attached the requested file. I think this might have to do with the size of the max indel--much appreciated for checking into this.

Best, Dakota

rrwick commented 3 years ago

This turned out to be a fairly straightforward bug: the lowercase characters in your inputs confused Trycycler. I guess I never tested it with an assembler that uses lowercase in its FASTA files!

I've fixed the problem by making Trycycler explicitly convert to uppercase loading and saving FASTAs. You can grab the fixed version by either pulling from the master branch or from this fresh release: v0.4.2.

Thanks for letting me know and sharing your sequence for debugging purposes!

Ryan

spock commented 3 years ago

I have encountered the same issue with version 0.5.0, so this isn't upper/lower case anymore. And I haven't seen any warnings about MUSCLE alignments, so probably not #4 either.

At first, I had 4 assemblies, which looked like this after reconcile:

A_tig00000001 vs B_contig_3...    98.94% identity, max indel = 42
A_tig00000001 vs C_Utg2960...     99.53% identity, max indel = 242
A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3
   B_contig_3 vs C_Utg2960...     98.51% identity, max indel = 264
   B_contig_3 vs D_ctg1...        98.90% identity, max indel = 251
    C_Utg2960 vs D_ctg1...        99.48% identity, max indel = 224

Pairwise identities:
  A_tig00000001: 100.00%   98.94%   99.53%   99.92%
  B_contig_3:     98.94%  100.00%   98.51%   98.90%
  C_Utg2960:      99.53%   98.51%  100.00%   99.48%
  D_ctg1:         99.92%   98.90%   99.48%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   42  242    3
  B_contig_3:     42    0  264  251
  C_Utg2960:     242  264    0  224
  D_ctg1:          3  251  224    0

(To achieve the above, I had to slightly increase the max_indel_size to 265 and max_add_seq to 11000.)

At trycyler msa step this resulted in:

    Each of the MSA pieces are now merged together and saved to file.

MSA length: 3,327,393 bp
Traceback (most recent call last):
  File ".local/bin/trycycler", line 8, in <module>
    sys.exit(main())
  File ".local/lib/python3.7/site-packages/trycycler/__main__.py", line 51, in main
    msa(args)
  File ".local/lib/python3.7/site-packages/trycycler/msa.py", line 36, in msa
    merge_pieces(temp_dir, args.cluster_dir, seqs)
  File ".local/lib/python3.7/site-packages/trycycler/msa.py", line 187, in merge_pieces
    assert seqs[n].upper() == msa_minus_dashes
AssertionError

I then tried removing the most obvious outlier C:

A_tig00000001 vs B_contig_3...    98.94% identity, max indel = 42
A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3
   B_contig_3 vs D_ctg1...        98.90% identity, max indel = 251

Pairwise identities:
  A_tig00000001: 100.00%   98.94%   99.92%
  B_contig_3:     98.94%  100.00%   98.90%
  D_ctg1:         99.92%   98.90%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   42    3
  B_contig_3:     42    0  251
  D_ctg1:          3  251    0

Still got AssertionError, so I removed the "second worst" B, and then trycycler msa worked:

A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3

Pairwise identities:
  A_tig00000001: 100.00%   99.92%
  D_ctg1:         99.92%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001: 0  3
  D_ctg1:        3  0

Unfortunately, I can't share the sequences for detailed debugging, but I do have some observations about these assemblies:

This is unlikely to help much, but here are some more outputs from the first stages of the pipeline:

A (canu_2019.fasta):
  288,715 alignments, mean depth = 605.7x
  A_tig00000001: 3,335,954 bp, 609.6x
  A_tig00000003:    29,452 bp, 167.2x

B (flye_2019.fasta):
  288,162 alignments, mean depth = 610.7x
  B_contig_1:    36,610 bp, 499.2x
  B_contig_3: 3,286,113 bp, 611.9x

C (raven_2019.fasta):
  288,532 alignments, mean depth = 608.5x
  C_Utg2958:    18,561 bp, 231.9x
  C_Utg2960: 3,319,309 bp, 610.6x

D (redbean_2019.fasta):
  287,825 alignments, mean depth = 611.1x
  D_ctg2:     8,729 bp,   0.0x
  D_ctg1: 3,307,815 bp, 612.8x

...

    Mash is used to build a distance matrix of all contigs in the assemblies.

A_tig00000001: 0.000  0.250  0.192  0.001  0.250  0.002  0.001
A_tig00000003: 0.250  0.000  0.250  0.250  0.005  0.250  0.250
B_contig_1:    0.192  0.250  0.000  0.250  0.250  0.192  0.204
B_contig_3:    0.001  0.250  0.250  0.000  0.250  0.002  0.001
C_Utg2958:     0.250  0.005  0.250  0.250  0.000  0.250  0.250
C_Utg2960:     0.001  0.250  0.178  0.002  0.250  0.000  0.002
D_ctg1:        0.001  0.250  0.204  0.001  0.250  0.002  0.000

# I have only worked on cluster_001: cluster_003 is a clear artifact,
# and I'm not entirely sure what cluster_002 is - could be a  plasmid assembled to 2x of it's length
# (at least that's what the dotplot suggests).

2019/cluster_001/1_contigs:
  2019/cluster_001/1_contigs/A_tig00000001.fasta: 3,335,954 bp, 609.6x
  2019/cluster_001/1_contigs/B_contig_3.fasta:    3,286,113 bp, 611.9x
  2019/cluster_001/1_contigs/C_Utg2960.fasta:     3,319,309 bp, 610.6x
  2019/cluster_001/1_contigs/D_ctg1.fasta:        3,307,815 bp, 612.8x

2019/cluster_002/1_contigs:
  2019/cluster_002/1_contigs/A_tig00000003.fasta:    29,452 bp, 167.2x
  2019/cluster_002/1_contigs/C_Utg2958.fasta:        18,561 bp, 231.9x

2019/cluster_003/1_contigs:
  2019/cluster_003/1_contigs/B_contig_1.fasta:       36,610 bp, 499.2x

and a dotplot for cluster_001

image

DingJingZhi commented 1 month ago

Merging MSA (2024-09-13 16:33:34) Each of the MSA pieces are now merged together and saved to file. File "/home/djz/miniconda3/envs/trycycler/lib/python3.10/site-packages/trycycler/msa.py", line 196, in merge_pieces assert seqs[n].upper() == msa_minus_dashes AssertionError my solution: line 196, add "#", results is ok, passed!

AssertionError File "/home/djz/miniconda3/envs/trycycler/lib/python3.10/site-packages/trycycler/consensus.py", line 675, in sanity_check_msa assert seqs[n] == msa_seqs[n].replace('-', '')

my solution: line 675, add "#", results is ok, passed!

then, I harvested this: Saving sequence to file: trycycler/cluster_001/7_final_consensus.fasta

RRwich, OK?

rrwick commented 1 month ago

It might be okay, but I would like to understand why that assertion caused a crash. Simply commenting out the assertion line prevents the crash, but it might have allowed corrupted data through.

If you can share your data, could you send me the 2_all_seqs.fasta file? Then I can try running trycycler msa myself to replicate the problem.

marade commented 2 weeks ago

Hi Ryan. I've encountered this error a couple times recently. While I cannot send the sequences, I did inspect 2_all_seqs.fasta in Vim. Specifically I used :/[^ATCG] to search for non-ATCG characters, and there were none except for the headers:

>C_utg000001c
>F_utg000001c
>G_contig_2

Hope this helps.

rrwick commented 2 weeks ago

Sorry, but that doesn't shed any light on the problem :confused:

If you're able to email me the file, I'll delete it once I've debugged the issue. I won't include my email here (for bot-scraping reasons), but you can find it in the license message at the top of Trycycler's source files.

marade commented 2 weeks ago

Okay, I sent you an e-mail.