labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
242 stars 29 forks source link

Annotation error: gene coordinates exceeding contig length #254

Closed MilanAdd closed 2 months ago

MilanAdd commented 3 months ago

Hello PPanggolin team,

I am trying your tool for the first time on Python 3.9 to build a cyanobacteria pangenome on several different identity thresholds, including 28% here. I tried running this command: ppanggolin all --fasta GENOMES_FASTA_LIST.tsv --output pangenome_28 --identity 0.28 --cpu 12 -f

It first ignores the negative coordinates generated by Aragorn, which makes sense, but then it spits out this error related to the gene coordinates exceeding the contig length:

2024-07-24 10:58:21 utils.py:l169 INFO  Command: /home/milu/anaconda3/envs/ppanggolin/bin/ppanggolin all --fasta GENOMES_FASTA_LIST.tsv --output pangenome_28 --identity 0.28 --cpu 12 -f
2024-07-24 10:58:21 utils.py:l170 INFO  PPanGGOLiN version: 2.1.0
2024-07-24 10:58:21 utils.py:l767 INFO  12 parameters have a non-default value.
2024-07-24 10:58:21 annotate.py:l1178 INFO  Reading GENOMES_FASTA_LIST.tsv the list of genome files
2024-07-24 10:58:21 annotate.py:l1195 INFO  Annotating 2121 genomes using 12 cpus...
  0%|                                                | 0/2121 [00:00<?, ?file/s]2024-07-24 10:58:22 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Thr', 'c[-4,68]', '33', '(ggt)']  This RNA is ignored.
  4%|█▌                                     | 86/2121 [00:54<21:34,  1.57file/s]
2024-07-24 10:59:42 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Asn', '[-2,70]', '33', '(gtt)']  This RNA is ignored.
2024-07-24 11:00:48 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Lys', 'c[-1,72]', '34', '(ttt)']  This RNA is ignored.
2024-07-24 11:00:48 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Thr', 'c[-1,71]', '33', '(tgt)']  This RNA is ignored.
2024-07-24 11:00:48 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Phe', 'c[-2,71]', '34', '(gaa)']  This RNA is ignored.
2024-07-24 11:01:34 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', 'c[-1,73]', '34', '(cat)']  This RNA is ignored.
2024-07-24 11:01:54 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Lys', 'c[-1,72]', '34', '(ttt)']  This RNA is ignored.
2024-07-24 11:02:36 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Leu', '[-2,80]', '34', '(tag)']  This RNA is ignored.
2024-07-24 11:02:56 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Lys', 'c[-1,72]', '34', '(ttt)']  This RNA is ignored.
2024-07-24 11:02:56 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Thr', 'c[-1,71]', '33', '(tgt)']  This RNA is ignored.
2024-07-24 11:03:19 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', 'c[-2,72]', '35', '(cat)']  This RNA is ignored.
2024-07-24 11:04:42 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Gly', '[-2,70]', '33', '(ccc)']  This RNA is ignored.
2024-07-24 11:05:45 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Asn', 'c[-2,70]', '33', '(gtt)']  This RNA is ignored.
2024-07-24 11:06:04 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Pro', 'c[-3,71]', '35', '(tgg)']  This RNA is ignored.
2024-07-24 11:07:02 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Ser', 'c[-1,88]', '36', '(gga)']  This RNA is ignored.
2024-07-24 11:07:28 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Ile', 'c[-1,75]', '36', '(gat)']  This RNA is ignored.
2024-07-24 11:08:15 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Cys', 'c[-2,70]', '33', '(gca)']  This RNA is ignored.
2024-07-24 11:08:29 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', 'c[-2,72]', '35', '(cat)']  This RNA is ignored.
2024-07-24 11:08:29 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Ala', 'c[-4,69]', '34', '(tgc)']  This RNA is ignored.
2024-07-24 11:08:39 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Ser', 'c[-1,86]', '36', '(tga)']  This RNA is ignored.
2024-07-24 11:08:59 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Trp', '[-2,71]', '34', '(cca)']  This RNA is ignored.
2024-07-24 11:09:01 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', 'c[-1,73]', '34', '(cat)']  This RNA is ignored.
2024-07-24 11:09:05 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Pro', 'c[-4,70]', '35', '(tgg)']  This RNA is ignored.
2024-07-24 11:10:25 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', 'c[-1,75]', '36', '(cat)']  This RNA is ignored.
2024-07-24 11:10:40 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Ser', 'c[-2,83]', '35', '(tga)']  This RNA is ignored.
2024-07-24 11:11:30 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Arg', '[-2,69]', '32', '(ccg)']  This RNA is ignored.
2024-07-24 11:11:36 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Phe', 'c[-1,72]', '34', '(gaa)']  This RNA is ignored.
2024-07-24 11:13:16 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Asp', 'c[-2,72]', '35', '(gtc)']  This RNA is ignored.
2024-07-24 11:13:41 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Lys', 'c[-1,74]', '35', '(ctt)']  This RNA is ignored.
2024-07-24 11:14:33 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Lys', 'c[-1,74]', '35', '(ttt)']  This RNA is ignored.
2024-07-24 11:15:00 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Leu', '[-2,77]', '34', '(tag)']  This RNA is ignored.
2024-07-24 11:15:56 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', '[-3,74]', '35', '(cat)']  This RNA is ignored.
2024-07-24 11:16:43 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', 'c[-2,70]', '33', '(cat)']  This RNA is ignored.
2024-07-24 11:16:43 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Met', '[-2,70]', '33', '(cat)']  This RNA is ignored.
2024-07-24 11:16:43 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Pro', 'c[-2,72]', '35', '(ggg)']  This RNA is ignored.
2024-07-24 11:17:50 synta.py:l77 WARNING    Aragorn gives non valide coordiates for a RNA gene: ['1', 'tRNA-Ser', 'c[-1,88]', '36', '(gga)']  This RNA is ignored.
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/site-packages/ppanggolin/annotate/synta.py", line 377, in annotate_organism
    gene.add_sequence(get_dna_sequence(contig_sequences[contig.name], gene))
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/site-packages/ppanggolin/annotate/synta.py", line 316, in get_dna_sequence
    assert highest_position <= len(
AssertionError: Gene coordinates exceed contig length. gene coordinates [(65755, 65827)] vs contig length 65826
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/milu/anaconda3/envs/ppanggolin/bin/ppanggolin", line 10, in <module>
    sys.exit(main())
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/site-packages/ppanggolin/main.py", line 222, in main
    ppanggolin.workflow.all.launch(args)
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/site-packages/ppanggolin/workflow/all.py", line 295, in launch
    launch_workflow(args, panrgp=True, panmodule=True)
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/site-packages/ppanggolin/workflow/all.py", line 96, in launch_workflow
    annotate_pangenome(pangenome, args.fasta, tmpdir=args.tmpdir, cpu=args.annotate.cpu,
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/site-packages/ppanggolin/annotate/annotate.py", line 1207, in annotate_pangenome
    pangenome.add_organism(future.result())
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/milu/anaconda3/envs/ppanggolin/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
AssertionError: Gene coordinates exceed contig length. gene coordinates [(65755, 65827)] vs contig length 65826

This is the genome set I'm using that I've obtained from GTDB (I filtered out around 200 of these genomes out for my actual set): https://gtdb.ecogenomic.org/advanced?exp=KDEmMiYzKQ~~&1=MX4yfmN5YW5vYmFjdGVyaW90YQ~~&2=NTJ.MTJ.OTk~&3=NTN.MTB.Mw~~

I'm aware that this assertion error is built into the annotation script for certain edges cases, which mine probably is for some reason. Any ideas or suggestions on how this can be dealt with?

Thank you!

JeanMainguy commented 3 months ago

Hi, I was able to reproduce the issue on my end. Thanks for the clear indication.

It looks like the problem comes from Aragorn giving gene coordinates that go beyond the contig length. We knew it sometimes gives negative coordinates, and we handle those cases by throwing a warning and ignoring the gene, as you noticed in your log.

However, we didn't anticipate it could also give coordinates that exceed the contig length. We'll fix this by identifying these cases and throwing a similar warning. I'll work on patching that very soon.

Thanks for reporting this issue ! Best,

JeanMainguy commented 2 months ago

Hi, This bug has been fixed and is now included in version 2.1.1 . Best