gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
373 stars 39 forks source link

GFF to GTF transformation to use then RSEM #33

Open EidrianGM opened 5 years ago

EidrianGM commented 5 years ago

I wanted to transform the gff (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gff.gz) to gtf. This was in order to use RSEM later on. I used:

gffread GCF_000001405.38_GRCh38.p12_genomic.gff -E -T -o GCF_000001405.38_GRCh38.p12_genomic.gtf

Then rsem found the gtf corrupted when trying to create the reference genome:

The GTF file might be corrupted! Stop at line : NC_000001.11 BestRefSeq exon 17369 17391 . - . transcript_id "rna3"; Error Message: Cannot find gene_id!

As the error message say the _"geneid" tag could not be found in the bold lines (below). Maybe this is due to "rna3" being a MIR thus no "gene_id" should be expected?

This is the gtf

NC_000001.11 BestRefSeq transcript 11874 14409 . + . transcript_id "rna0"; gene_id "gene0"; gene_name "DDX11L1"; NC_000001.11 BestRefSeq exon 11874 12227 . + . transcript_id "rna0"; gene_id "gene0"; NC_000001.11 BestRefSeq exon 12613 12721 . + . transcript_id "rna0"; gene_id "gene0"; NC_000001.11 BestRefSeq exon 13221 14409 . + . transcript_id "rna0"; gene_id "gene0"; NC_000001.11 BestRefSeq transcript 14362 29370 . - . transcript_id "rna1"; gene_id "gene1"; gene_name "WASH7P"; NC_000001.11 BestRefSeq exon 14362 14829 . - . transcript_id "rna1"; gene_id "gene1"; NC_000001.11 BestRefSeq exon 14970 15038 . - . transcript_id "rna1"; gene_id "gene1"; [..] NC_000001.11 BestRefSeq transcript 17369 17391 . - . transcript_id "rna3"; gene_name "MIR6859-1"; NC_000001.11 BestRefSeq exon 17369 17391 . - . transcript_id "rna3"; NC_000001.11 BestRefSeq transcript 17369 17436 . - . transcript_id "rna2"; gene_id "gene2"; gene_name "MIR6859-1"; NC_000001.11 BestRefSeq exon 17369 17436 . - . transcript_id "rna2"; gene_id "gene2";

If it helps to someboy RSEM could successfully use the gtf outputed by gffread specifying the flag -C (coding only discard mRNAs that have no CDS features) when transforming the ncbi gff:

gffread GCF_000001405.38_GRCh38.p12_genomic.gff -E -T -C -o GCF_000001405.38_GRCh38.p12_genomic.gtf

But what if we want to quantify miRNAS?

gpertea commented 5 years ago

Good point, I guess gffread is too loose about GTF output, not enforcing a gene_id when no such info is found in the input, even though likely the GTF specification (or most software expectations about GTF input, I suppose) seems to demand a gene_id as well (not just transcript_id as I assumed).

There is a simple solution to these cases: I will make gffread print a gene_id which has the same value as transcript_id's value. (I will add this shortly to the dev branch here in github and it'll soon make its way to the official release).

The gff you referenced actually has an interesting situation for me, where some miRNA transcripts are parented by a _primarytranscript feature, which is another transcript, not a gene, as I expected -- and that's why gffread couldn't assign a gene_id to such transcripts. This is very unexpected, and it breaks this assumption I had in gffread's code (that parents of transcripts, if present, are always genes). This is more of a note to self, but now I have to rethink and adjust some of the gffread code in order to deal with this new (to me) hierarchy exception.