gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Compatibility of -c and -e options? #290

Open fairliereese opened 4 years ago

fairliereese commented 4 years ago

Hi, I'm running StringTie on some data that is well-annotated. I ran the command with the following options:

bam=path/to/bam.bam
prefix=my_prefix
ref_annot=path/to/ref/annot.gtf

./stringtie \
    -L \
    -p 16 \
    -c 1 \
    -e \
    -G ${ref_annot} \
    -o ${prefix}_stringtie.gtf \
    -A ${prefix}_abundance.tsv \
    $bam

What I am expecting in the resultant GTF is for transcripts that are both:

  1. Expressed in the sample with at least a coverage of 1 (enforced by the -c option)
  2. Annotated in the reference GTF (enforced by the -e option)

However, the resulting GTF I get has a large number of transcript entries with coverage = 0 listed in their attributes field (see attached image). Are the -c and -e options supposed to be incompatible? From my testing, this seems to be the case. When removing the -e option, I no longer get any transcripts in my output GTF that violate the -c constraint.

Can someone provide some insight? Thanks!

Screen Shot 2020-07-30 at 10 30 02 AM

gpertea commented 4 years ago

Yes, -c and other similar transcript assembly constraints only apply to StringTie's own assembly process -- when read alignments are "stringed together" into putative transcripts. However the -e option precludes that assembly process, it assumes that the transcripts are already given (in the file specified by the -G option) hence no applicability for any of the assembly process constraints (and the corresponding options). The -e option directs StringTie to just estimate the abundance of an already provided set of transcripts.

fairliereese commented 4 years ago

Okay this explains my results directly, but I am still a little puzzled. I am running my data through the recommended workflow to get DE results, which involved:

  1. Running StringTie on each of my replicates separately, with a reference annotation GTF, and with minimum coverage = 1
  2. Running StringTie merge on the output GTFs to get the union of the transcripts that were found to be expressed in each of the samples (by virtue of the fact that I used the minimum coverage parameter in step 1)
  3. Run StringTie again on each replicate separately, this time using the merged GTF from step 2 as the reference annotation. This is where I tried running it with the -c 1 -e and -c 1 options separately as well.

My assumption for step 3 (when using the -e option) was that since these transcripts passed the expression threshold in step 1, they would have some non-zero expression value associated with them after requantifying them with the updated merged reference GTF. Perhaps my intuition is wrong though. Do you have any insight on this result?