splicebox / PsiCLASS

Simultaneous multi-sample transcript assembler for RNA-seq data
16 stars 4 forks source link

Requesting help with some issues with PsiCLASS #1

Open sagnikbanerjee15 opened 4 years ago

sagnikbanerjee15 commented 4 years ago

Hello,

I am running PsiCLASS on 43 RNA-Seq samples from Arabidopsis Thaliana. The whole run completes very fast and also produces quite good results. I compared the assemblies from PsiCLASS with those I obtained from StringTie and Scallop. Though PsiCLASS performs better on a genome-wide scale I noticed a few issues with some transcripts. I am highlighting each issue below. For each case, I have attached an example of IGV screenshot of the locus. In the images, you will find that there are 4 transcriptomic tracks. The first track is the TAIR10 annotation representing the ground truth. The second track represents transcripts merged from StringTie and Scallop. The third track is the transcriptome generated from ONE of the 43 samples. The fourth track represents transcripts from the consensus of all the 43 samples.

1) Trimming transcripts on the ends

Most of the transcripts have the end exons trimmed (image attached). I know that some assemblers like StringTie have parameters to disable trimming. I was wondering if PsiCLASS offers any such parameters. image

2) Some transcripts are not being assembled in spite of having read coverage

Some transcripts have low coverage but have spliced junctions. Those transcripts are not assembled. I have tried reducing the depth coverage cutoff to 0.5 but it did not help.

image

3) Merged transcripts

Transcripts located very closely on the same strand are being merged. Sometimes the two transcripts are of high coverage - separated by a region of low coverage. In other cases, one of the transcripts has a high coverage but the other one has low coverage. There are several examples of this kind in the assembly.

image

image

image

4) Overlapping transcripts on opposite strand

For some overlapping transcripts on opposite strands, the first exon of one transcript is exactly the same as the last exon of the other transcript. There is a noticeable dip in the coverage in that exon which could be used to correctly determine the extent of each exon. image

5) Truncated transcripts

There are a few loci where the transcript seems to have been abruptly truncated in spite of there being read coverage.

image

It would be great if you could help me with these cases.

Thank you.

mourisl commented 4 years ago

Thank you for the feedback and examples!

  1. PsiCLASS determine the transcript ends across all the samples. It uses median position from all the samples to determine the transcription start or end sites. So if you look at one sample, the read coverage might be different from the transcript assembled. From the simulated data set we tested, this approach has better accuracy on detecting true transcript ends.

  2. This could be a bug in the program. Could you please whether the coordinates of introns show up in the subexon/{prefix}_subexon_combined.out file?

  3. These happen mostly because there are enough samples with high read coverage between these two genes, and PsiCLASS may think there is a long exon connecting these two genes. For the second example, based on the coverage, there should be transcripts for the two genes reported no matter what. I'll check that.

  4. Thanks for giving this example. PsiCLASS does not handle this situation yet. I'll look into how to implement this. This could fix the problems in example 3.

  5. The reason is the same for example 1.

Thank you!

sagnikbanerjee15 commented 4 years ago

Hello,

Thank you for the fast reply! Here are my comments.

1) Will it be possible to point to me the place in the code which calculates the median position. I think it would serve my purpose better if I choose the longest one instead of the median length.

2) I checked the subexon_combined.out and also the corresponding splice file under the folder splice/. Both the files contain the exons and all the introns from the locus.

3) Thanks!

4) Thanks!

5) Thanks!

Let me know if you need any more files from my end. I will be happy to provide it to you.

mourisl commented 4 years ago
  1. It's between line 986-993 in CombineSubexon.cpp file. You can change that to: if ( rawSubexons[i].leftType == 0 && rawSubexons[i].rightType != 0 ) { rawSubexons[i].start = rawSubexons[ i ].start ; } else if ( rawSubexons[i].rightType == 0 && rawSubexons[i].leftType != 0 ) { rawSubexons[i].end = rawSubexons[ j-1 ].end ; } The longest one could be too risky, you can try something like i + (j-i)/4 and i+(j-i)*3/4 to try other quantile. After recompile, you can run psiclass with option "--stage 2" to avoid recomputing the previous stages.

  2. Could you please check whether the reads for that gene are mostly aligned to multiple coordinates?

Thanks.

sagnikbanerjee15 commented 4 years ago
  1. Thanks, I will try that.

  2. Yes! I check the locus. All the reads from that region are multi mapped reads. But at the same time, all those mappings are Primary, which means that the alternative alignments were not better. I further found that each read from this locus also maps to TWO additional loci, all on the same chromosome. So these reads are from 3 genes - AT4G09200.1, AT4G09310.1, and AT4G09250.1. Among these AT4G09200.1 and AT4G09310.1 have the exact same sequence. Here is the link to the multiple sequence alignment of the gene sequences. https://www.ebi.ac.uk/Tools/services/rest/muscle/result/muscle-I20191106-014118-0952-29125247-p1m/aln-html

Here is the link to the multiple sequence alignment of their corresponding protein sequences. https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=muscle-I20191106-015201-0945-65719879-p1m

And as expected, PsiCLASS does not report the gene from any of the 3 loci.

Thank you!

mourisl commented 4 years ago

Even if the aligner reports one alignment as primary alignment, it can have the same alignment quality as the secondary alignments. I think the plant genome is more repetitive than human genome, Since the sequences of the genes are so similar with each other, it is difficult to infer which genes are really expressed, and PsiCLASS is a bit conservative on this and report none of them. No matter what, I will add an option to retain the gene if most of the alignments in it are primary.

sagnikbanerjee15 commented 4 years ago

Yes, I had STAR consider all alignments with the highest same score as primary. So my bam file can potentially have several same score primary alignments. It would be great if you could add an option to output all such genes.

Thanks!

sagnikbanerjee15 commented 4 years ago

I was trying different settings and I found that eliminating the NH and HI from the STAR alignments helps to retain genes that were constructed from multi-mapping reads. So that solves one of the issues I was having! I just wanted to run it by you to make sure this "fix" isn't breaking anything.

Thanks.

mourisl commented 4 years ago

Yes, this "fix" is definitely fine. I just uploaded the code to add the option "--primaryParalog", so it will use primary alignment instead of the proportion of unique alignments to filter those paralog genes. This essentially is equivalent to what you did.

I'm working on other issues.

sagnikbanerjee15 commented 4 years ago

Thank you so much for taking care of these issues!

mourisl commented 4 years ago

What is the length of those falsely merged exons you saw? I think I can add some more tests before merging those exons of medium size.

sagnikbanerjee15 commented 4 years ago

Hello,

I looked at a few examples. For most of the cases, the length of the exons was not the problem. The main issue was with coverage which formed a trough. I have noticed very small exons being merged and also quite large ones. Hopefully, that helps.

Thank you.

sagnikbanerjee15 commented 4 years ago

Hello,

I am generating assemblies from 20 RNA-seq samples. There is a transcript that appears to have been split in spite of having continuous coverage in at least 4 samples. Please see the image attached below. I have not found this issue throughout the genome so I guess it's very occasional but I still wanted to let you know about it. image

Thank you.

sagnikbanerjee15 commented 3 years ago

Hi @mourisl,

Thanks for creating a conda package. I was wondering if you could modify psiclass to include options to extend the end exons. I am currently modifying the code and compiling it. Due to this reason I have to provide the whole package within my pipeline. I will be nice to have an option to generate full-length end exons and not the median as the default.

Thank you.

mourisl commented 3 years ago

Yes, I think I can add an option to specify which rank to determine the end of the exon, such as 0 for the shortest exon from all the samples, 0.5 uses median, and 1 for longest end exon. Is this what you need?

sagnikbanerjee15 commented 3 years ago

Exactly, That's what I need.

sagnikbanerjee15 commented 3 years ago

Hi @mourisl,

Did you get around to implementing this feature in the conda package?

Thanks.

mourisl commented 3 years ago

Sorry for the late reply. I just updated the feature with the new option "--tssTesQuantile". The default value is 0.5, which corresponds to the median previously. I guess conda will automatically update some time?

sagnikbanerjee15 commented 3 years ago

Hello @mourisl,

Thank you for attending to this. I will check it out. I don't think conda will automatically update it. You will need to create a new version of the psiclass package on GitHub and the create a new recipe on conda as a new version.

Thank you.