gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
361 stars 76 forks source link

StringTie --merge #402

Open cc-prolix opened 1 year ago

cc-prolix commented 1 year ago

Hey,

I have a question regarding StringTie --merge:

I am using StringTie to assemble novel transcripts in a very small non model genome. The genome has very few introns and mostly single exon genes. I was following the basic reference guided StringTie workflow and have assembled transcripts using a dataset of around 60 paired-end & strand specific RNA-seq samples (50bp) with various reads counts and conditons. In an initial run I looked for assembled transcripts I deemed novel. However, after including more data in the assembly process and merging the assembled files, many transcripts that were predicted prior to the inclusion were lost.

I am running stringtie with default options: stringtie -rf -g 5 -G reference.gtf -o /out.gtf -p 10 -l labels in.bam followed with stringtie --merge: stringtie --merge -p 5 -G reference.gtf -o stringtie_merged.gtf mergelist.txt

I then extracted novel assembled transcripts that were not contained in the reference annotationy. (MSTRG.31 in this example)

seqnames start   end width strand    source       type score phase  gene_id           transcript_id    ref_gene_id exon_number       gene_name
76 LT962476.2 10373 11869  1497      - StringTie transcript  1000    NA MSTRG.30 unassigned_transcript_2 BQ9382_C1-0010        <NA>            XUT5
77 LT962476.2 10373 11869  1497      - StringTie       exon  1000    NA MSTRG.30 unassigned_transcript_2 BQ9382_C1-0010           1            XUT5
78 LT962476.2 12209 13567  1359      - StringTie transcript  1000    NA MSTRG.31              MSTRG.31.1           <NA>        <NA>            <NA>
79 LT962476.2 12209 13567  1359      - StringTie       exon  1000    NA MSTRG.31              MSTRG.31.1           <NA>           1            <NA>
80 LT962476.2 12332 13459  1128      + StringTie transcript  1000    NA MSTRG.32 unassigned_transcript_3 BQ9382_C1-0015        <NA> ACIB2EUKG767782
81 LT962476.2 12332 13459  1128      + StringTie       exon  1000    NA MSTRG.32 unassigned_transcript_3 BQ9382_C1-0015           1 ACIB2EUKG767782
92 LT962476.2 15085 15498   414      - StringTie transcript  1000    NA MSTRG.38 unassigned_transcript_4 BQ9382_C1-0020        <NA> ACIB2EUKG767783
93 LT962476.2 15085 15498   414      - StringTie       exon  1000    NA MSTRG.38 unassigned_transcript_4 BQ9382_C1-0020           1 ACIB2EUKG767783
94 LT962476.2 16359 16598   240      + StringTie transcript  1000    NA MSTRG.39 unassigned_transcript_5 BQ9382_C1-0025        <NA> ACIB2EUKG767784
95 LT962476.2 16359 16598   240      + StringTie       exon  1000    NA MSTRG.39 unassigned_transcript_5 BQ9382_C1-0025           1 ACIB2EUKG767784

I then included two additonal datasets, also consisting of paired-end strand specific data (150bp) with various reads counts and conditons. I ran stringtie-merge for each separate dataset (MSTRG.31 does not pop up at its coorinates for the two additional datasets as I expected due to different sample conditions etc.):

Additional dataset 1:

seqnames start   end width strand    source       type score phase  gene_id           transcript_id    ref_gene_id exon_number       gene_name
136 LT962476.2 10373 11869  1497      - StringTie transcript  1000    NA MSTRG.45 unassigned_transcript_2 BQ9382_C1-0010        <NA>            XUT5
137 LT962476.2 10373 11869  1497      - StringTie       exon  1000    NA MSTRG.45 unassigned_transcript_2 BQ9382_C1-0010           1            XUT5
138 LT962476.2 12332 13459  1128      + StringTie transcript  1000    NA MSTRG.46 unassigned_transcript_3 BQ9382_C1-0015        <NA> ACIB2EUKG767782
139 LT962476.2 12332 13459  1128      + StringTie       exon  1000    NA MSTRG.46 unassigned_transcript_3 BQ9382_C1-0015           1 ACIB2EUKG767782
140 LT962476.2 15085 15498   414      - StringTie transcript  1000    NA MSTRG.47 unassigned_transcript_4 BQ9382_C1-0020        <NA> ACIB2EUKG767783
141 LT962476.2 15085 15498   414      - StringTie       exon  1000    NA MSTRG.47 unassigned_transcript_4 BQ9382_C1-0020           1 ACIB2EUKG767783
142 LT962476.2 16359 16598   240      + StringTie transcript  1000    NA MSTRG.48 unassigned_transcript_5 BQ9382_C1-0025        <NA> ACIB2EUKG767784
143 LT962476.2 16359 16598   240      + StringTie       exon  1000    NA MSTRG.48 unassigned_transcript_5 BQ9382_C1-0025           1 ACIB2EUKG767784

Additional dataset 2 shows two other novel transcripts (MSTRG.32 & MSTRG.33):

seqnames start   end width strand    source       type score phase  gene_id           transcript_id    ref_gene_id exon_number       gene_name
75  LT962476.2  8200  9269  1070      + StringTie transcript  1000    NA MSTRG.32              MSTRG.32.1           <NA>        <NA>            <NA>
76  LT962476.2  8200  9269  1070      + StringTie       exon  1000    NA MSTRG.32              MSTRG.32.1           <NA>           1            <NA>
77  LT962476.2  9389  9693   305      + StringTie transcript  1000    NA MSTRG.33              MSTRG.33.1           <NA>        <NA>            <NA>
78  LT962476.2  9389  9693   305      + StringTie       exon  1000    NA MSTRG.33              MSTRG.33.1           <NA>           1            <NA>
79  LT962476.2 10373 11869  1497      - StringTie transcript  1000    NA MSTRG.34 unassigned_transcript_2 BQ9382_C1-0010        <NA>            XUT5
80  LT962476.2 10373 11869  1497      - StringTie       exon  1000    NA MSTRG.34 unassigned_transcript_2 BQ9382_C1-0010           1            XUT5
81  LT962476.2 12332 13459  1128      + StringTie transcript  1000    NA MSTRG.35 unassigned_transcript_3 BQ9382_C1-0015        <NA> ACIB2EUKG767782
82  LT962476.2 12332 13459  1128      + StringTie       exon  1000    NA MSTRG.35 unassigned_transcript_3 BQ9382_C1-0015           1 ACIB2EUKG767782
83  LT962476.2 15085 15498   414      - StringTie transcript  1000    NA MSTRG.36 unassigned_transcript_4 BQ9382_C1-0020        <NA> ACIB2EUKG767783
84  LT962476.2 15085 15498   414      - StringTie       exon  1000    NA MSTRG.36 unassigned_transcript_4 BQ9382_C1-0020           1 ACIB2EUKG767783
107 LT962476.2 16359 16598   240      + StringTie transcript  1000    NA MSTRG.44 unassigned_transcript_5 BQ9382_C1-0025        <NA> ACIB2EUKG767784
108 LT962476.2 16359 16598   240      + StringTie       exon  1000    NA MSTRG.44 unassigned_transcript_5 BQ9382_C1-0025           1 ACIB2EUKG767784

I expected, that including all assembled GTFs in my mergelist.txt file and executing stringtie-merge would result in a merged annotation including MSTR.31 from my initial run as well as for example MSTRG.32 & MSTRG.33 from the additional dataset 2. However in the final merged GTF they are all missiang at the location:

 seqnames start   end width strand    source       type score phase  gene_id           transcript_id       gene_name    ref_gene_id exon_number
148 LT962476.2 10373 11869  1497      - StringTie transcript  1000    NA MSTRG.48 unassigned_transcript_2            XUT5 BQ9382_C1-0010        <NA>
149 LT962476.2 10373 11869  1497      - StringTie       exon  1000    NA MSTRG.48 unassigned_transcript_2            XUT5 BQ9382_C1-0010           1
150 LT962476.2 12332 13459  1128      + StringTie transcript  1000    NA MSTRG.49 unassigned_transcript_3 ACIB2EUKG767782 BQ9382_C1-0015        <NA>
151 LT962476.2 12332 13459  1128      + StringTie       exon  1000    NA MSTRG.49 unassigned_transcript_3 ACIB2EUKG767782 BQ9382_C1-0015           1
152 LT962476.2 15085 15498   414      - StringTie transcript  1000    NA MSTRG.50 unassigned_transcript_4 ACIB2EUKG767783 BQ9382_C1-0020        <NA>
153 LT962476.2 15085 15498   414      - StringTie       exon  1000    NA MSTRG.50 unassigned_transcript_4 ACIB2EUKG767783 BQ9382_C1-0020           1
154 LT962476.2 16359 16598   240      + StringTie transcript  1000    NA MSTRG.51 unassigned_transcript_5 ACIB2EUKG767784 BQ9382_C1-0025        <NA>
155 LT962476.2 16359 16598   240      + StringTie       exon  1000    NA MSTRG.51 unassigned_transcript_5 ACIB2EUKG767784 BQ9382_C1-0025           1

We are a bit puzzled by this behaviour. Could you expain how stringtie --merge includes or excludes assembled transcripts and what could have happend here?

Thank you very much!

Jungal10 commented 5 months ago

This is a guess here, but I believe it is because they are single-exon. Then, the merge option excludes everything single-exon