Off-by-one error - Githubissues

CDieterich commented 6 years ago

I have tried to use gffread-0.9.9.Linux_x86_64 to merge EnsEMBL and RefSeq GTF annotations with adjusted chromosome naming.

Program call:

cat GRCm38.RefSeqwoChrSlimmed.gtf GRCm38.87_slimmed.gtf |/home/cdieterich/software/gffread-0.9.9.Linux_x86_64/gffread - -g /biodb/genomes/mus_musculus/GRCm38_87/GRCm38_87.fa -o testGTFslim -M -F -T -Q -K

However, I noticed that the stop codon (CDS records) is off by one in the results file if the transcript is on the "+" plus (fine on "-").

Could you please provide a fix.

crip3_plus_igv_snapshot

CDieterich commented 6 years ago

Please note that the "T" from TGA is included in the CDS

gpertea commented 6 years ago

Strange, there should be no reason for that to happen.. I guess you already checked the CDS stop coordinates for all your input transcripts in that region, and none of them has this problem? Because the only possible explanation I see now is that the clustering is somehow causing the extension of all the overlapping CDS to the largest overlapping CDS in the cluster. Anyway, I've just had some changes to the code, could you please try the newest version to check if this is still happening? Use this link for direct download of the pre-compiled Linux x86_64 version: http://ccb.jhu.edu/software/stringtie/dl/gffread-0.9.10.Linux_x86_64.tar.gz

If you still get this CDS stop coordinate shift, could you please isolate a few transcripts from the input GTF you are using and that can reproduce the issue ,and send those to me so I can debug it properly on my side?

eboileau commented 6 years ago

Thanks for your reply. I am also following up on this issue, on behalf of @CDieterich. Input annotations exclude the stop codon in the coding region coordinates. I have tested the latest version using the link, however the problem persist. On the positive strand, the first base of the stop codon is included in the CDS. I am using options -T -F -M -K -Q.

I include a subset of the 2 input annotations file (chrY). issue19.tar.gz

eboileau commented 6 years ago

GRCm38_87Ya.fa.gz The fasta file is split in 2.

eboileau commented 6 years ago

GRCm38_87Yb.fa.gz

CDieterich commented 6 years ago

Dear @gpertea , thank you very much for looking into this. We hope that this problem can be fixed with the input from @eboileau

gpertea commented 6 years ago

Got the files (the genome fasta file wasn't really necessary, was it? it's not needed for the options given), it seems to have only alignments on chrY. Where (what chromosome location) have you noticed this problem on chrY ? I ran gffread with -T -F -M -K -Q on these data (with or without the genome the output GTF is the same), using a command like this:

cat GRCm38Y-RefSeq.gtf GRCm38Y-EnsEMBL.gtf | ../gffread - -o outg.gtf -M -F -T -Q -K

I took a quick look at CDS on the positive strand in the out.gtf but I can't quite see any [systematic] off-by-one at the CDS end coordinates in the output file. Do you have a specific CDS coordinate on these chrY data you sent where I should look to observe this off-by-one anomaly ?

eboileau commented 6 years ago

This is the command I used, with or without -g fasta, with or without option Q for M: cat GRCm38Y-RefSeq.gtf GRCm38Y-EnsEMBL.gtf | gffread - -o merged.gtf -T -F -M -K As far as I can tell, all CDS coordinates on the positive strand include the first base of the stop codon. For example: tga taa

gpertea commented 6 years ago

Thanks, that helped me narrow it down to the cases where a "stop_codon" feature is explicitly given, there is a bug in the code which somehow doesn't add those 3 bases correctly to the CDS feature, and only 1 base is added instead (?!) -- very silly bug, sorry for that mistake, I'm going to fix it swiftly.

By the way, it seems there is no clear consensus out there if the stop codon should be part of the CDS. I've seen some opinions that in GFF3 the CDS features should include it, but the GTF2 convention initially suggested to exclude it.. However, gffread was not devised to output separate separate start/stop codon features, so the output CDS is going to include the stop codon (especially because gffread is GFF3-centric). I hope that's OK with you -- the output CDS is going to include the stop codon entirely (not just the first base of it, as I admitted that is just a bug). However during protein translation (when using the -y option), gffread will simply drop the stop codon translation from the output. Of course that codon will still be there in the spliced CDS sequence (-w option).

gpertea commented 6 years ago

I fixed the code to correctly include the whole stop codon in the CDS, please re-download the fixed package (didn't change the version as there was only a minor fix), the pre-compiled Linux x86_64 version is here (same link as before): http://ccb.jhu.edu/software/stringtie/dl/gffread-0.9.10.Linux_x86_64.tar.gz

Please let me know if you find any other problems with this version.

eboileau commented 6 years ago

Thanks for providing a fix promptly. Indeed, GFF3 specs advocate for including it in the coding region. That seem to resolve the problem.

gpertea / gffread

Off-by-one error #19