Exons not being trimmed in ssRNAseq data, combining adjacent genes

mncfletcher commented 9 years ago

Hello,

I've been playing with StringTie for the past few months and I've come to believe that - with the strand-specific RNAseq data that I have, at least - the trimming of exons at drops in coverage is not occurring.

I observe this behaviour for both the first/last exon/UTRs and also internal exons; the example I've attached below shows the gene on the right having very good coverage over the last exon/UTR and a dropoff in coverage where the RefSeq reference transcript ends; however, StringTie additionally assembles the reads between the two genes into one long exon, and so ends up stitching together two transcripts into one artefactual one.

20150812_igv_02_no_trimming

This particular assembly was run with minimum isoform fraction of 0.2 - if I lower this to 0.1 or 0.05 there are many, many more untrimmed exons (presumably because they are not filtered out because they pass this minimum isoform threshold).

Cheers!

ohdongha commented 9 years ago

Hello, I would like to report the same problem. In the attached example, the RNA-seq peaks appear to indicate that there are three gene models (say, Tp1g09210, Tp1g09220, and Tp1g09230, as in the example). However, stringtie assembled very long transcripts (TCONS_00001299 and 1300) encompassing all three gene models, as the most highly expressed transcript. This has been happening too often.

I used the most recent version (version 1.0.4), with " -j 5 -c 10 -a 20 -f 0.2 ", on 100nt strand-specific singled-end data. Is there an option of controlling the sensitivity of coverage drop detection? If there is, I would like to increase the stringency of the drop detection.

150815_misassembly_example__cut

Please advise if there are changes in parameter/options that could help solving this problem. Thanks!

mpertea commented 9 years ago

Dear StringTIE users, From what I can see in the images you submitted the correct transcripts were also assembled. "Trimming" in StringTie means introducing a transcription start/stop at a certain location in the genome where the coverage drops substantially, not totally discarding the reads crossing that location. Therefore, while a transcription start might be detected at that point, the reads that cross that location will also be assembled into a different transcript - the one that you wouldn't like to see, but could just have a different transcription initiation site. The expression level at that transcript is hopefully low.

ohdongha commented 9 years ago

Dear Dr Pertea, Thanks for the response!

In my example, I supplied Stringtie with ORF models called by an ab initio gene predictor, as a reference annotation (in the IGV screenshot, they are Tp1g09210, 09220, and 09230). Stringtie seemed to create a copy of each supplied ORF model by default (in the screenshot, TCONS_00001301, 1302, and 5621). These copies of annotated ORFs (all having class_id "="), being ORFs rather than full transcripts, correctly had mostly 0 FPKMs, if there alternative transcripts assembled with proper TSS and TES. Hence, in the example I attached above, all newly assembled transcripts (TCONS_0001299 and 1300) were chimeras of what appeared as three separate genes, and were with high FPKMs. This was commonplace when genes were closely spaced.

I really like the "trimming" functionality, which appears to estimate TSS and TES positions as best as possible from the RNA-seq coverage. I wonder whether there is a way to control the sensitivity of this trimming. How the Stringtie detects the drop of RNA-seq coverage?

I would like to also know which parameter should be changed, to exclude from the assembly those low coverage reads that extend over the range of a gene. If I could set a minimum threshold for something like Kmer occurrence, or read coverage (either absolute value, or, maybe better, compared to adjacent genomic regions), I think some of those assemblies overflowing into a neighboring gene could be prevented. So far I tried to increase -c options, but it didn't improve the situation. (Maybe I should try running Jellyfish and exclude low coverage reads before starting Stringtie...?)

Any advice/suggestion would be appreciated. Thanks again!

Best wishes, Dong-Ha

mncfletcher commented 9 years ago

Dear Dr Pertea,

Thank you for your reply! I can see what you mean - indeed, there are still reads present at that location.

I do agree with Dong-Ha that it would be nice to have finer control of when/how trimming occurs, e.g. by specifying a %age threshold of the well-covered parts of the transcript. I have seen examples in my data where StringTie assembles transcripts present in the guide transcriptome reference, but with lengthened 5’ and 3’ UTRs where there is a clear drop-off in coverage. Sometimes these UTRs can be 10s of kilobases long, and in my stranded RNAseq data for highly expressed genes, there can be many reads from these (presumably) unprocessed transcripts, and therefore they can have quite levels of expression in FPKM.

Best,

Mike

gpertea commented 8 years ago

The trimming method has been adjusted since this discussion, I'm closing this for now but feel free to reopen it (or open a new one) if you have more comments/questions/requests about the current trimming approach.

mncfletcher commented 8 years ago

Hello,

I was wondering which version of StringTie has the adjusted trimming method?

I’ve run the same analysis with the same options in v1.0.4 and v1.2.2 and I didn't notice any substantial improvement in trimming at clear drop-offs in coverage. If there’s been another change since then I would happily try again though!

Best,

Mike

On Sep 20, 2016, at 4:02 PM, Geo Pertea notifications@github.com<mailto:notifications@github.com> wrote:

The trimming method has been adjusted since this discussion, I'm closing this for now but feel free to reopen it (or open a new one) if you have more comments/questions/requests about the current trimming approach.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gpertea/stringtie/issues/19#issuecomment-248310301, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJ8_Ygo1ru65jNPTLxz_OfdPeD60LPAXks5qr-eCgaJpZM4FqJRc.

Dr Michael Fletcher

Postdoctoral Fellow Division of Molecular Genetics / B060

DKFZ / German Cancer Research Centre (Deutsches Krebsforschungszentrum in der Helmholtz-Gemeinschaft, Stiftung des öffentlichen Rechts) Im Neuenheimer Feld 280 D-69120 Heidelberg

E-Mail: m.fletcher@dkfz-heidelberg.demailto:m.fletcher@dkfz-heidelberg.de Web: www.dkfz.dehttp://www.dkfz.de

Confidentiality Note: This message is intended only for the use of the named recipient(s) and may obtain confidential and/or privileged information. If you are not the intended recipient, please contact the sender and delete the message. Any unauthorized use of the information contained in this message is prohibited.

gpertea commented 8 years ago

Sorry I just realized that the recent trimming adjustments did not actually address the original issue reported here so I'll reopen this. Thank you for pointing this out.

dariober commented 5 years ago

Bump: I'm seeing the same issue with stringtie v1.3.5. I wonder if anything has been done... Thanks!

gpertea / stringtie

Exons not being trimmed in ssRNAseq data, combining adjacent genes #19