tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
823 stars 225 forks source link

CDS labels do not match #624

Open hayleyjaywilson opened 2 years ago

hayleyjaywilson commented 2 years ago

I have carried out the following annotation run on ~1000 isolates: for i inmore list; do echo ${i}; prokka ${i} --proteins fm204883.genbank --locustag SEQ --outdir ${i}_prokka_results; done

This has annotated the genes fine however I have an issue with CDS's. Say in my ref genome (fm20488) the CDS is named SEQ0024. This label does not then carry over to the annotated isolates. SEQ0024 CDS in a different genome is not the same as SEQ0024 in my reference. Have I missed a step? I need to compare various CDS among lots of genomes but this can't happen if different bits are labelled differently. Is there a way to achieve this please?

0xaf1f commented 2 years ago

Locus tags are numbered incrementally in the genome, so SEQ0024 as you have it will always be the 24th gene in a given sample, which is usually not the same across samples. In any case, you want to be looking at the gene name field instead of the locus tag field when doing your comparison. But locus tag prefixes should also be made unique to each sample to avoid confusion.

hayleyjaywilson commented 2 years ago

Thanks that makes sense now.