gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

Stringtie merge converts Ensemble GTF to MSTRG #214

Open cryptic0 opened 5 years ago

cryptic0 commented 5 years ago

I have been running through the protocol described in Pertea et al 2016. The stringtie function properly imports the gene and transcript IDs from the reference annotation. However, during the stringtie --merge step, it converts them both to MSTRG IDs.

Is there a way to avoid this conversion to keep original reference IDs? I provided the reference GTF during this step using the -G flag. Also, I saw some threads on biostars that indicated that one also needs to use the -l flag, but all that's going to do is use a custom prefix rather than MSTRG which is a non-solution solution.

Here is my commandline:

 stringtie --merge -p 8 -G refannot.gtf -o stringtie_merged.gtf mergelist.txt
yirenheihei commented 4 years ago

Hi,cryptic0,did you sovle your question?I got the same question.

wyoibc commented 4 years ago

You can additionally use the -e and -B flags to restrict transcript assembly to those known and present in the refGTF. However, that still won't get rid of the MSTRG ids. AFAIK, these internal IDs are assigned when a discovered transcript only partially overlaps the known transcript. If stringtie is not 100% certain (by way of full overlap) of the identity of transcript, it is going to assign the MSTRG tag, which are always numbered serially.

The only workaround I know of is to do this manually with Unix sed or awk. If you search on biostars, there is some explanation by Geo of this internal mapping ID assignment, and he might chime in here as well.