Open monicabritton opened 4 years ago
That is strange, may I see a few example records from that file? Not sure what you meant with this:
When I run gffread with any combination of -w / -x / -y the transcript sequence is correct, and begins with "ATG"
Obviously -y cannot do that (since the output should be the translated sequence), but it looks like the "exons" are read correctly while CDS have a +1 offset at the beginning. Is it possible that all CDS features somehow have 1 instead of 0 in the "phase" column (8th column) in your file?
I believe this is related to issue #21
I have a gff3 that only contains gene and CDS lines (no corresponding exon lines). When I run gffread with any combination of -w / -x / -y the transcript sequence is correct, and begins with "ATG", but each header includes "CDS=2-<#>" , and the corresponding CDS begins with "TG" so the protein translations are all incorrect. I am using the latest version, gffread-0.11.6.Linux_x86_64.
Using the info in #21 I was able to get the correct translations with this command: gffread input.gff3 -g genome.fasta -w - | seqmanip.pl -T > proteins.fasta
The fasta headers in proteins.fasta still include the (incorrect) "CDS=2-<#>" but the translations now start with M and I don't see any incorrect STOPs. So, the workaround works, but gffread should be revised.