gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

What is the "-m" option doing #312

Open ytcheung opened 3 years ago

ytcheung commented 3 years ago

According to the manual, the "-m" option controls the minimum length allowed for the predicted transcripts. So, I expect that the output gtf file will just contain transcripts longer than the threshold.

I used "-m 200" for one sample, and then I tried to calculate transcript length from the gtf file by using the following R code:

gtf<-  fread("transcripts.gtf")
gtf$length <- gtf$V5 - gtf$V4 + 1
summary(gtf[gtf$V3=="transcript",]$length)

Suprisingly, I got the following results.

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     34     256     354    2831    1209  674207 

So, what does the "-m" option mean acutally?

mpertea commented 3 years ago

The -m parameter only controls the length of novel assembled transcripts. If you used the -G parameter, all the transcripts in the reference annotation will be considered as well, no matter what their length. I suggest cleaning that reference file of the transcripts you are not interested in before giving it to StringTie.

ytcheung commented 3 years ago

The -m parameter only controls the length of novel assembled transcripts. If you used the -G parameter, all the transcripts in the reference annotation will be considered as well, no matter what their length. I suggest cleaning that reference file of the transcripts you are not interested in before giving it to StringTie.

Thank you so much! Now I got it! Is that the same for the "--merge" mode? The -m parameter controls the minimum input length of novel transcripts?