splicebox / PsiCLASS

Simultaneous multi-sample transcript assembler for RNA-seq data
16 stars 4 forks source link

PsiCLASS cannot annotate micro-exons #6

Open sagnikbanerjee15 opened 4 years ago

sagnikbanerjee15 commented 4 years ago

Hello,

I tried a multitude of alignment options with a bunch of aligners. I was finally able to find one which can map reads to micro-exons. I fed these alignments, along with the alignments of other reads, to PsiCLASS. There were 2 single end sample alignments that I fed to PsiCLASS. In total, I was looking at 12 transcripts which I KNOW have micro-exons - a positive control set. PsiCLASS was able to correctly annotate the micro-exons in 10 of these cases. For 2 transcripts, it could not annotate the micro-exons. I will discuss each of the cases one-by-one.

Case 1: Picture1

The part highlighted contains the micro-exons. I have checked the psiclass_output_bam.trusted_splice file and surprisingly, it contains both the introns that define the micro-exon. I checked the subexon_combined.out file and there is a line corresponding to the micro-exon. 4 695701 695705 1 2 + + -1 -1 -1 1.000000 1.000000 2 695594 695700 2 695706 695795

I am not sure what the numbers mean. There are about 108 reads that define the micro-exon.

Case 2: Screen Shot 2020-02-07 at 3 38 10 PM

This case is very similar to the one above. Just like the previous case the introns and the exons were present in the files but the annotation was missing.

Thank you.

mourisl commented 4 years ago

From the coverage plot, it seems these microexon have much lower expression than other exons. There are some detail tuning options not available from the wrapper psiclass. Could you try to add the option "-f 0.01" to classesOpt in the wrapper psiclass? This will lower the filtration threshold. Thanks.

sagnikbanerjee15 commented 4 years ago

Hello,

I checked my files and found that the micro-exon in the first case has 108 reads. For the second case, there are 21 reads. I have a few examples, where a micro-exon was detected with even fewer reads. My only concern about setting the option "-f 0.01" is that it might lead to a lot of false positives in other transcripts.

Thank you.

mourisl commented 4 years ago

From your figure, there are definitely enough reads for that transcript. However, comparing with other transcripts from this gene, its abundance might be relatively low. Even in the first example, visually the expression of the transcript containing the micro-exon is about 10% of the other transcripts, the underlying gene structure might be more complicated than that. For example, the mate might be aligned to some exon that are not in the gene model, so the abundance could be lower than expected.

During development, I tried to relax the filtration (-f) by keeping the transcripts with unique exons, but the performance is kind of equivalent to lowering the -f threshold. I think for your example, you can lower the -f threshold but make a stringent --vd option to make the filtration take effect at the cohort level.