gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

How to filter the merged transcripts #351

Open Huangyizhong opened 2 years ago

Huangyizhong commented 2 years ago

Hi, there! I have used the stringtie2 to the genome-based transcripts assembly. I used the hisat2 to do the alignment of the RNA-seq data , and then the picard the remove the PCR errors. Finally, used the stringtie2 to assembl the transcripts. I finally used the IGV to check some transcripts. There are some transcripts that are merged from two nearby genes, as showed in the following picture. Is there some parameters that can be used to filter them? or some scripts? Need help! image Thanks so much! Yizhong Huang

gpertea commented 2 years ago

This is usually caused by read alignments spanning/bridging the two genes when they are very close to each other, and there is currently no easy solution for that - if the "evidence" in the read alignments data points to that. What organism is this? It's also useful to look at the read alignments track in IGV, check if there are a lot of reads spanning that intergenic space etc.

Those genes seem to be very close to each other (hard to tell without seeing the annotation track), it's not clear if the "fusion" happens due to the terminal exons overlapping (TSS of one gene too close to TES of the other, or post-TES polymerase run-through?), or due to spurious (spliced) read alignments creating false "junctions" linking the two genes.

A script could be devised to split such "chimeric" transcripts but that would be a band-aid solution covering for a possibly deeper issue -- it would be interesting to look closer at WHY that really happens when it does -- there could be situations where such "fusion transcripts" across neighboring genes might be "real" and not just alignment artifacts (e.g. in case of genes sharing a transcriptional unit, i.e. polycistronic transcription which has been shown to be possible in eukaryotes as well, not just in bacterial operons).

Huangyizhong commented 2 years ago

Thanks for your quick reply. Agree with you. I have checked the IGV with the RNA-seq mapped data ,as showed below. As there were so many mapped read, I just showed parts of alignments. What's your suggestion about it ? Thanks again for your kind help! image image

AmrSaadeldin commented 2 years ago

Hi, I had the same issue; I tried to decrease the maximum intron length in the alignment which solved the problem ~ so far!

Huangyizhong commented 2 years ago

Hi, I had the same issue; I tried to decrease the maximum intron length in the alignment which solved the problem ~ so far!

Sounds great! How to set the parameter to do it and have you solve this problems? I used the exons number (below 7) in the UTR region to filter the transcripts. I also check it in the IGV, almost all the fusion transcripts can be identified.

AmrSaadeldin commented 2 years ago

Hi, I had the same issue; I tried to decrease the maximum intron length in the alignment which solved the problem ~ so far!

Sounds great! How to set the parameter to do it and have you solve this problems? I used the exons number (below 7) in the UTR region to filter the transcripts. I also check it in the IGV, almost all the fusion transcripts can be identified.

I solved the problem by reducing the maximum intron length in the alignment step, not the assembly. Check your aligner documentation and change this parameter.

Huangyizhong commented 2 years ago

Thanks so much. I used the hisat2. To align the illumina paired data. And I check the --max-intronlen for it . But how to set this parameters, I confused . Thanks so much.

2022年3月11日 02:09,AmrSaadeldin @.***> 写道:

maximum intron length