CDS and protein sequences missing first base when exon line not present

gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more

MIT License

362 stars 39 forks source link

I believe this is related to issue #21

I have a gff3 that only contains gene and CDS lines (no corresponding exon lines). When I run gffread with any combination of -w / -x / -y the transcript sequence is correct, and begins with "ATG", but each header includes "CDS=2-<#>" , and the corresponding CDS begins with "TG" so the protein translations are all incorrect. I am using the latest version, gffread-0.11.6.Linux_x86_64.

Using the info in #21 I was able to get the correct translations with this command: gffread input.gff3 -g genome.fasta -w - | seqmanip.pl -T > proteins.fasta

The fasta headers in proteins.fasta still include the (incorrect) "CDS=2-<#>" but the translations now start with M and I don't see any incorrect STOPs. So, the workaround works, but gffread should be revised.

gpertea / gffread

CDS and protein sequences missing first base when exon line not present #47