gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
377 stars 39 forks source link

Output proteome file has unexpected sequences #102

Closed alexvasilikop closed 1 year ago

alexvasilikop commented 2 years ago

Hello,

I want to extract the translated cds features (concatenated per gene) from the gff but some extracted sequences have a "." character in the sequence. Is this expected?

e.g. see below: $ /mnt/sda1/Alex/software/gffread-0.12.7.Linux_x86_64/gffread -C -g Schmidtea_mediterranea.assembly.fa -y Schmidtea_mediterranea.pep.fa --no-pseudo Schmidtea_mediterranea.no_iso.gff

One sequence in the fasta looks like this:

SMEST011213001.1 MASLKDERSSAEHIRV.LETEAGEYDKLNEKLTDKGNNVKSPEPEISIQLKTSTTKEMKKKLREKINQEL PSKNSDETEIYSRKSTMYEITRDEPEMRKQEPIYSSLKRNIQEMHSERKCNEEDLNEKKRNWKFGKENS

You can see there is a dot there in the first line.

Best and thanks for the help

unavailable-2374 commented 1 year ago

Hello,

I have encountered the same confusion. Is this problem a comment problem or something? This "." should be deleted or replaced with another character.

Best and thanks for the help

gpertea commented 1 year ago

That period character represents a stop codon encountered in the translation. I know the "standard" is unfortunately to use the star ( *) character instead, which seems rather inappropriate and misleading for my regex-biased mind :). Period means "end of sentence" so it seemed natural to use that character to depict the stop codon "translation". Anyway, gffread has a -S option to force the translated output use * instead of . for stop codons, if you prefer that.