Enable quantification using StringTie AND a custom Ensembl genome

mplescher commented 1 year ago

Description of feature

Hi.

I am using the nf-core rnaseq pipeline, version 3.12.0. Since you pointed out that the transcriptome and GTF files in iGenomes are vastly out of date here, I am using a custom Ensembl genome, version 110. I tried out these two:

I would like to use StringTie for transcript assembly and quantification, but had to face this bug. It seems like all genes in the ensembl genome lack the transcript_id required for StringTie. Since StringTie only needs the annotation of transcripts anyway, simply removing all genes from the GTF file solves the problem. e.g. run:

 awk -F'\t' '($3!="gene")' my_genome.gtf > my_genome.no_genes.gtf

Would it be possible to check for ensembl genomes automatically and (temporarily) remove the gene lines if necessary? Many thanks.

pinin4fjords commented 10 months ago

@mplescher could you please provide a reproducible example of this? I've actually written a solution up quickly, but wanted to replicate your issue before I ask to merge, and find that I can't.

For example, in the test profile, I intercepted the stringtie process, and changed the first entry to a gene without a transcript_id attribute, and that seemed to be fine.

MatthiasZepper commented 10 months ago

Probably related to/similar to #1102?

For an example GTF file see this thread in the rnaseq-Slack channel and this gffread issue for some background information.

pinin4fjords commented 10 months ago

Well, this is a stringtie error, so possibly not directly related, but yes, maybe it's empty transcript_id attributes rather than missing ones which are the issue.

MatthiasZepper commented 10 months ago

Yes, that is what I was thinking.

Multiple tools struggle with empty strings in the GTF attributes, because that is against the original format specification. Yet, NCBI and Ensembl release this kind of GTF files now.

This is why I linked the gffread issue above, where Geo Pertea states:

Apparently somebody had the brilliant idea to allow stuff like this (empty string values!) in GTF: This is the kind of abomination that a zombie format like GTF becomes as it tries to "evolve" and be something that was never meant to be. GTF was really meant only for annotating transcripts, nothing else. For any other features -- use GFF3, that's why it's there!

Let's be fair, the vast majority of bioinformatics pipelines care only about transcript/exon/CDS annotation. I assumed/hoped that's why GTF was allowed to survive alongside GFF3 for so long -- as a convenient shortcut for the annotation users to get only the transcript annotation whenever that was all they needed, instead of the GFF3 which can have many other genomic features that might not be needed for those annotation users.

This is an easy fix for me but I really dislike encouraging the survival of bad ideas (which this GTF2.2 perversion clearly is), let's allow "natural selection" of such ideas take its course instead (hopefully).

Based on this, I do not expect a fix for gffread and possibly other tools as well.

Hence, I think, we must remove those lines or at least include a check, because everyone trying to use an up-to-date reference transcriptome in GTF format will likely download an invalid file.

pinin4fjords commented 10 months ago

OK, think this is now addressed in https://github.com/nf-core/rnaseq/pull/1107. I've also added a line to the GTF in the test data to make doubly sure I'm right.

nf-core / rnaseq

Enable quantification using StringTie AND a custom Ensembl genome #1074

Description of feature