Open EidrianGM opened 5 years ago
Good point, I guess gffread is too loose about GTF output, not enforcing a gene_id
when no such info is found in the input, even though likely the GTF specification (or most software expectations about GTF input, I suppose) seems to demand a gene_id
as well (not just transcript_id
as I assumed).
There is a simple solution to these cases: I will make gffread print a gene_id
which has the same value as transcript_id
's value. (I will add this shortly to the dev branch here in github and it'll soon make its way to the official release).
The gff you referenced actually has an interesting situation for me, where some miRNA transcripts are parented by a _primarytranscript feature, which is another transcript, not a gene, as I expected -- and that's why gffread couldn't assign a gene_id
to such transcripts. This is very unexpected, and it breaks this assumption I had in gffread's code (that parents of transcripts, if present, are always genes). This is more of a note to self, but now I have to rethink and adjust some of the gffread code in order to deal with this new (to me) hierarchy exception.
I wanted to transform the gff (
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gff.gz
) to gtf. This was in order to use RSEM later on. I used:Then rsem found the gtf corrupted when trying to create the reference genome:
As the error message say the _"geneid" tag could not be found in the bold lines (below). Maybe this is due to "rna3" being a MIR thus no "gene_id" should be expected?
This is the gtf
If it helps to someboy RSEM could successfully use the gtf outputed by gffread specifying the flag -C (coding only discard mRNAs that have no CDS features) when transforming the ncbi gff:
But what if we want to quantify miRNAS?