gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

Can StringTie --merge be used to create a slightly improved reference annotation? #276

Open ggstatgen opened 4 years ago

ggstatgen commented 4 years ago

Hi

The official dog gtf annotation on Ensembl V100 is missing some important genes. Interestingly, Ensembl also holds two other dog annotations, based on 2 different breeds.

I would like to 'improve' the official dog gtf by merging it with the other two gtfs. I have tried using StringTie to do this, based on the following command line

stringtie --merge -p 10 -v -o stringtie_merged_withref.gtf -G Canis_lupus_familiaris.CanFam3.1.100.chr_sorted.gtf input_gtf_list.txt

So I have the reference Ensembl annotation as an argument for -G followed by a list of the other gtf files to add.

The problem with this is that it'll discard all gene entries and only leave entries of type transcript or exon.

Any help/suggestions on how to do this properly appreciated

gpertea commented 4 years ago

Don't use stringtie --merge for this purpose, that feature was designed for merging multiple stringtie outputs, not generic annotation data. It can lead to losing some isoforms from the annotation by "assembling" overlapping/compatible isoforms together and in most cases you do not want that kind of data loss.

gffread (http://dx.doi.org/10.12688/f1000research.23297.1) would be a better choice for merging multiple annotations by reducing redundancy across multiple datasets, in a more conservative and controlled way. Take a look at the "Clustering" options of gffread (-M/-K/-Q) to control the merge process.

However gffread won't solve the problem of losing gene entries, those are still going to be lost in the process of merging data from multiple annotation files. The gene entries cannot be preserved practically if you merge two or more transcripts (from different annotation files) that have a different gene ID. gffread can however generate locus features during clustering, which also shows all the genes and transcripts that were merged/gathered in the same locus. One can easily transform those locus features into gene features. (gffread would group under the same locus all the transcripts linked by exon overlaps, which is how gene features are generally defined). We can discuss more about this in the gffread github (https://github.com/gpertea/gffread) issues section there, perhaps.

ggstatgen commented 4 years ago

Thanks so much for the speedy reply - will gladly test gffread and let you know how it goes for this on its github repo if needed!

smallfishcui commented 1 year ago

Hi, did anyone test using gffread to improve annotation? How is the result?

thanks, Cui

dxu104 commented 1 year ago

@gpertea Hi I am trying to use gffread to improved axolotl reference annotation. But when I read a paper(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222033/) said,"While GffRead can convert, sort, filter, transform, or cluster genomic features, GffCompare can be used to compare and merge different gene annotations." I am kind of confused which one I should use improved axolotl reference annotation? Looking forward to your reply. Thank you in advance.