tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
822 stars 224 forks source link

fasta headers different between *.fna and *.faa file #674

Closed bheimbu closed 11 months ago

bheimbu commented 12 months ago

Hi @tseemann,

I'm a bit confused here. fna and .faa files should have the same headers, right? But my files look like this:

*.fna

>BEC2_contig1
GCAGGTGTGTGTAGCATGGTGTGTGGTGAGAGGGGTCAGTACGTCCGTTGTCAGGCAGGG
AAGGTTTTATGGGTGTTTGGTTCTTAGTAGGGTGTCAAAGATAGCAAAGATAACCTTAAG
ATTGATTATGGGAAATTTTAATGACGGCGGTTGAGCATTTGCAGTTTTCTTGTTGATCGG
TTTAAACTGCAATTGTACCTGCGGGCTGCTGCATGCGTTTCGGCGTGAATCTTGAGCGGC
ATTCGCGAAATATTTATCAGAATATAAAAGTCGCGGACAGCAGTTGTTGGGAATTAATAA
ACGCACTTTATGTTCAAGTCCTTTTGTTACAGTTATGTTTATGTGTGAGGGAGTTTCAGG
ACAAACTGCAATATAGCGTAGTTGGAGGTAAGGAAAAGAAAATTACAGGAAACAGAGACT
ATTAAATAGTTAGGTGCTTCGTAATTTGTGTTTTTCTCTAAATAATTTTTTTTGGCGCTA
CAGCCACCCCCCCCCCCAGTGGGCCAGGGCCTCATTATCGTGGCTTCACAATCACACTCA
GACACACTACACTCGGTAGAACTCCTCCGTACGAGTGATCAGCCGAACGCAGAGACCTCT
ACCTGTCAACACGCAACACTCACAAGAGACAAAGGTCCGCGCCCTCCGGCGGGATTCGAA
CCCGCAATCCCAGGAAGCGAGCGGCCGCAGAACGACGCCTTAGACCGCGCGGCCACCGGG
ATCGGCCAGAAACATCAGAATGATTAATAGAGCAACGGTGACGACGACACTTACTGCCGG
GAAACAAAAACACGCAACAGCCTGCTGATGCAGTTAATTGTAATTACACAAATATTTACT
TACTTACTTACTCCATGGCGCAGAGTCCTTCTTGAGAAGCTGACTGGTTCTGCAGCTGGT
CATATTCCCCACATTTTATCGAACCCGGAAGTTCATTTCCGCACTCACAAGTGCCCGCCA
CCTGTCCCTATCCTGAGCCAACTCCATCCAGTCCCCACAAACCCCTCCCACTTCCTGCAG
GTCCATCTTAATATTATCCTCCCATCTACGTCTGGGTCTCCCCAATGGTCTCTTTCCCTC
AGGTTTCCTCACCAGAAGCCTGTGCACACCTCTCGCCTCCCCATACGCGCCACATGCCCC
GACCATCTCATTCTTCTCGATTTTATCACC
>BEC2_contig2
TTGACAGTATACCAAAAAGGTGTTTATTACTCAGGAATTAAGATCTACAATTATCTACCA
ACAGTCATTAAAGAATTATCTGGTGATAAGAATAAATTCAAACTAGCTCTAAAAAGATAC
CTCTTACATAATTCCCTTTACAGTCTGGAGGAATATTTTAATCCATAATTAACTATGATA
TTAACATTATTCTTATTATTACTTATACTTACTTTAATTAAGTACCTTTAATTGTTAATG
TAGCTATCCTATGACACTACAGTAAGGCACAACTTGTGCTGAAAGTACACTGGCTTTATT
TATTATGCTAAATGTATACATGACCAGTTCCACATCTGTATAAGATCAATGGAATGTGAA

*.faa

>PJHIIHGH_00001 hypothetical protein
MVGACGAYGEARGVHRLLVRKPEGKRPLGRPRRRWEDNIKMDLQEVGGVCGDWMELAQDR
DRWRALVSAEMNFRVR
>PJHIIHGH_00002 hypothetical protein
MEDKCADREDLSNMHSLLQNIKANSEVAPETPALKAYKLPSLIVRYFQDPSAQHWAFMTS
NVTRAVDVTRVNQTRTDILQMSSYSRSTELHVLHPTSRLKIMEDNAGTRNNKKSHLYRPA
KCSALC

It's possible to recapitulate, on which contig e.g. PJHIIHGH_00001 sits, using a *.gtf file (https://github.com/EnvGen/metagenomics-workshop/blob/master/in-house/prokkagff2gtf.sh), however this is rather tedious.

My prokka command looks like this: prokka --force --cpus 10 --metagenome --prefix $sample --outdir prokka/$sample $sample.fna

Any help is highly appreciated,

Bastian