hoelzer-lab / ribap

A comprehensive bacterial core gene-set annotation pipeline based on Roary and pairwise ILPs
GNU General Public License v3.0
19 stars 3 forks source link

non-unique prokka IDs if input sequences are identical #45

Open klamkiew opened 1 year ago

klamkiew commented 1 year ago
Command error:
  Traceback (most recent call last):
    File "/home/co68mol/ribap/bin/combine_roary_ilp.py", line 435, in <module>
      read_roary_table(sys.argv[2])
    File "/home/co68mol/ribap/bin/combine_roary_ilp.py", line 65, in read_roary_table
      strain = formattedArray[column].replace('"','').strip()
  IndexError: list index out of range

sigh To quote a famous german musician: Es koennt alles, so einfach sein...

klamkiew commented 1 year ago

It seems to be an issue with our prokkaID to strain mapping:

grep 'JEHEBOFP' strain_ids.txt
JEHEBOFP,Chlamydia_abortus_strain_1H_full_genome_RENAMED
JEHEBOFP,Chlamydia_abortus_strain_AB7_full_genome_RENAMED
klamkiew commented 1 year ago

Aha! Apparently, prokka generates IDs based on fasta content:

diff Chlamydia_abortus_strain_1H_full_genome.fna Chlamydia_abortus_strain_AB7_full_genome.fna
1c1
< >LN554883.1 Chlamydophila abortus strain 1H genome assembly, chromosome: 1
---
> >LN589721.1 Chlamydophila abortus strain AB7 genome assembly, chromosome: 1

that obv. broke our code, where each ID is unique for a filename. We need to think about how to handle this in the future!