gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
362 stars 39 forks source link

CDS and protein sequences missing first base when exon line not present #47

Open monicabritton opened 4 years ago

monicabritton commented 4 years ago

I believe this is related to issue #21

I have a gff3 that only contains gene and CDS lines (no corresponding exon lines). When I run gffread with any combination of -w / -x / -y the transcript sequence is correct, and begins with "ATG", but each header includes "CDS=2-<#>" , and the corresponding CDS begins with "TG" so the protein translations are all incorrect. I am using the latest version, gffread-0.11.6.Linux_x86_64.

Using the info in #21 I was able to get the correct translations with this command: gffread input.gff3 -g genome.fasta -w - | seqmanip.pl -T > proteins.fasta

The fasta headers in proteins.fasta still include the (incorrect) "CDS=2-<#>" but the translations now start with M and I don't see any incorrect STOPs. So, the workaround works, but gffread should be revised.

gpertea commented 4 years ago

That is strange, may I see a few example records from that file? Not sure what you meant with this:

When I run gffread with any combination of -w / -x / -y the transcript sequence is correct, and begins with "ATG"

Obviously -y cannot do that (since the output should be the translated sequence), but it looks like the "exons" are read correctly while CDS have a +1 offset at the beginning. Is it possible that all CDS features somehow have 1 instead of 0 in the "phase" column (8th column) in your file?