Open ggstatgen opened 4 years ago
Don't use stringtie --merge
for this purpose, that feature was designed for merging multiple stringtie outputs, not generic annotation data. It can lead to losing some isoforms from the annotation by "assembling" overlapping/compatible isoforms together and in most cases you do not want that kind of data loss.
gffread (http://dx.doi.org/10.12688/f1000research.23297.1) would be a better choice for merging multiple annotations by reducing redundancy across multiple datasets, in a more conservative and controlled way. Take a look at the "Clustering" options of gffread (-M/-K/-Q) to control the merge process.
However gffread won't solve the problem of losing gene entries, those are still going to be lost in the process of merging data from multiple annotation files. The gene entries cannot be preserved practically if you merge two or more transcripts (from different annotation files) that have a different gene ID. gffread can however generate locus
features during clustering, which also shows all the genes and transcripts that were merged/gathered in the same locus. One can easily transform those locus
features into gene
features. (gffread would group under the same locus all the transcripts linked by exon overlaps, which is how gene features are generally defined). We can discuss more about this in the gffread github (https://github.com/gpertea/gffread) issues section there, perhaps.
Thanks so much for the speedy reply - will gladly test gffread and let you know how it goes for this on its github repo if needed!
Hi, did anyone test using gffread to improve annotation? How is the result?
thanks, Cui
@gpertea Hi I am trying to use gffread to improved axolotl reference annotation. But when I read a paper(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222033/) said,"While GffRead can convert, sort, filter, transform, or cluster genomic features, GffCompare can be used to compare and merge different gene annotations." I am kind of confused which one I should use improved axolotl reference annotation? Looking forward to your reply. Thank you in advance.
Hi
The official dog gtf annotation on Ensembl V100 is missing some important genes. Interestingly, Ensembl also holds two other dog annotations, based on 2 different breeds.
I would like to 'improve' the official dog gtf by merging it with the other two gtfs. I have tried using StringTie to do this, based on the following command line
stringtie --merge -p 10 -v -o stringtie_merged_withref.gtf -G Canis_lupus_familiaris.CanFam3.1.100.chr_sorted.gtf input_gtf_list.txt
So I have the reference Ensembl annotation as an argument for
-G
followed by a list of the other gtf files to add.The problem with this is that it'll discard all gene entries and only leave entries of type
transcript
orexon
.Any help/suggestions on how to do this properly appreciated