klebgenomics / Kleborate

GNU General Public License v3.0
116 stars 48 forks source link

Very long headers in the FASTA are not parsed correctly #87

Open CorinYeatsCGPS opened 6 days ago

CorinYeatsCGPS commented 6 days ago

I'm not sure the length limit, but I have a few FASTAs with >100 characters in the headers, which seems to cause Kleborate to fall over during the MLST stage. I replaced the original headers with shortened versions and the FASTA was processed correctly. Simply putting in a long run of digits was enough to trigger the issue. It might also be worth noting that in the FASTA which triggered this issue the first 300 characters of the header of each record were the same and couldn't be truncated.

strain  species N50     ST      virulence_score resistance_score        num_resistance_classes  num_resistance_genes
Traceback (most recent call last):
  File "/usr/local/bin/kleborate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/__main__.py", line 154, in main
    module_results = modules[module].get_results(unzipped_assembly, minimap2_index, args, results)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/modules/klebsiella_pneumo_complex__mlst/klebsiella_pneumo_complex__mlst.py", line 73, in get_results
    st, _, alleles = mlst(assembly, minimap2_index, profiles, alleles, genes, None,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/mlst.py", line 44, in mlst
    hits_per_gene = {g: align_query_to_ref(allele_paths[g], assembly_path,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/mlst.py", line 44, in <dictcomp>
    hits_per_gene = {g: align_query_to_ref(allele_paths[g], assembly_path,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 134, in align_query_to_ref
    alignments = [Alignment(x, query_seqs=query_seqs, ref_seqs=ref_seqs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 134, in <listcomp>
    alignments = [Alignment(x, query_seqs=query_seqs, ref_seqs=ref_seqs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 51, in __init__
    self.set_sequences(query_seqs, ref_seqs)
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 88, in set_sequences
    self.ref_seq = ref_seqs[self.ref_name][self.ref_start:self.ref_end]
                   ~~~~~~~~^^^^^^^^^^^^^^^
KeyError: '22222222222222222222222222222222222222222222222222222222222'
Marysteph commented 6 days ago

Thanks @CorinYeatsCGPS. I will address this.

CorinYeatsCGPS commented 5 days ago

After final review I found only one instance of this in almost 300,000 FASTA files, so it's not a big problem! Thanks.