splicebox / PsiCLASS

Simultaneous multi-sample transcript assembler for RNA-seq data
16 stars 4 forks source link

Assembling samples from multiple conditions #3

Open sagnikbanerjee15 opened 4 years ago

sagnikbanerjee15 commented 4 years ago

I realized that I will be working with a very large number of samples representing data from over 20 conditions. The paper mentions - "PsiCLASS is designed to leverage the gene information across samples in a one or two condition experiments, to improve the accuracy of gene and transcript reconstruction for downstream differential gene expression and differential splicing studies, and may not be ideally suited to multi-condition experiments, such as annotation of genes and alternative splicing in large data sets such as GTEx24 or TCGA2".

Will it be advisable to submit RNA-Seq alignments from all the samples simultaneously? Or would it be a better approach to individually assemble each project and then use TACO or ST-merge to combine the GTF files?

Thanks.

mourisl commented 4 years ago

For over 20 conditions, I think it would be better to run each condition separately or group a few conditions together. When putting all together, some condition-specific feature would show up in less than 5% of the samples, and could be easily regarded as noise and get filtered.

sagnikbanerjee15 commented 4 years ago

Great! Thanks.

sagnikbanerjee15 commented 4 years ago

Hello,

I am still working with samples from several conditions. I have used string-tie to merge the gtf assemblies but I am not happy with the results. I am aware of the fact that some condition-specific features would be eliminated if it appears in less than 5% of the samples. But PsiCLASS also gives output for each sample separately. So would the less frequent feature be in the sample GTF?

Thank you.

mourisl commented 4 years ago

The most filtered features are introns. If an intron is supported by less than 1% of the samples, it will be filtered no matter what. For other cases of an intron, it quite depends. For example, if only a few samples support an intron, but there are very many supporting reads, PsiCLASS will retain this feature. Once the introns are filtered, they will not show up in the underlying gene model, and the sample-wise GTF will not be able to report them. If the feature is a transcript, the sample GTF file will still report it.

sagnikbanerjee15 commented 4 years ago

Thanks for the explanation. You mentioned that "if only a few samples support an intron, but there are very many supporting reads, PsiCLASS will retain this feature". Is there a specific value for "few samples" or is there a trade-off between presence of an intron in a few samples and the read support for that intron?

mourisl commented 4 years ago

The average read support for an intron should be at least 0.5 across all samples. Though there are some more sophisticated filters such as handling multiple-alignment reads, this is the most essential one.

mourisl commented 4 years ago

It just comes to my mind that there might be a way to keep the condition-specific feature as much as possible. You can run PsiCLASS's first two steps reguarly for each condition (building the intron profile and the subexon graph for each sample). Then you can stop PsiCLASS there and run "combine-subexon" to combine all the sample-level subexon graghs. Then run "psiclass" on the bamlist of all the samples with the new global subexon graph.

sagnikbanerjee15 commented 4 years ago

That could work. As far as I remember PsiCLASS allows the computation to start from a certain point - but not end at a point. Could you please make necessary changes to the code which would allow me to stop the process at a certain point. Providing a mock example would greatly help.

Thank you.

mourisl commented 4 years ago

I just added the option --bamGroup. It needs a file where each line is the group id/name you assigned to the bam file specified in the bamlist file. It will filter the introns group by group, and the remaining steps are the same. Could you please give this a try?

sagnikbanerjee15 commented 4 years ago

I am getting an error which I believe is related to the addXS execution. It often comes up with "head: write error: Broken pipe".

Also, the PsiCLASS run with the new code ended in failure. I was monitoring the execution closely and I found that the junc command executed properly but the memory consumption for 'classes' was very high. This led to the premature termination of the program. Please find attached the log file. combined_classes.log

Thank you.

sagnikbanerjee15 commented 4 years ago

Hello,

I scheduled the run on a different server with more memory. This time the run completed successfully. Surprisingly, it took around 60 mins to assemble 20 samples. Is that too much time or shall I expect it to be that much? It seems like the previous version was faster. Will it be possible for you to create a release with the previous version. I have overwritten the previous download with the new one. I have a deadline to meet soon. I will try out the new version with the --bamGroup option later.

Thanks.

mourisl commented 4 years ago

I haven't updated the core code, so the efficiency should be the same. I guess you used the "--bamGroup" for your test. In this case, the underlying global gene model is much more complicated and could take more time than running without "--bamGroup".

sagnikbanerjee15 commented 4 years ago

Hello,

I think I have figured out the problem. There were several read alignments with huge introns (>50K). That was probably the reason why PsiCLASS was taking way too much time and also a lot of memory and was missing out on several transcripts.

Thanks.

mourisl commented 4 years ago

Yes, those false introns would connect several genes together, and relatively lower abundant genes that should not be connected together might be filtered.