Could AS quantification results from a single sample be merged directly downstream?

qingwangchen commented 3 years ago

spladder version: v2.4.2
Python version: Python 3.8.5
Operating System: Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC))
Machine Configuration: 8 Cores(CPUs)/16 GB

Description

We standardized the process of using SplAdder on AliCloud (https://www.aliyun.com/) with the WDL language. Describe what you were trying to get done. The details could be seen at http://47.103.223.233/chenqingwang/SplAdder/src/branch/master/tasks/spladder.wdl.

We used spladder to get the quantitative results of six types of AS events for a single sample in the cohort and would like to ask if there was a way to merge the above quantitative results into the one single file directly downstream. If so, which column should the merged id be?

What I Did

The command could be seen at http://47.103.223.233/chenqingwang/SplAdder/src/branch/master/tasks/spladder.wdl.

spladder build --bams ${bam} \
                --annotation ${reference_gtf_file} \
                --ignore-mismatches \
                --confidence 2 \
                --readlen 150 \
                --parallel $nt \
                --outdir ${sample_id}/spladder_out

akahles commented 3 years ago

Dear @qingwangchen,

Thank you very much for reaching out. To answer your question, I would like to better understand your use case. If I understand correctly, you are running SplAdder on several single samples in parallel and identify alternative splicing events for this sample. Then you would like to merge all events across all samples into a joint table. Is this understanding correct?

The main issue with this setup would be the following. SplAdder will only detect alternative splicing events if the event is used alternatively (that is there is read support for either path in the event) in the given sample. That means, looking at an exon skipping event, you would need to see read support for both inclusion and exclusion of the event to be reported.

To work around this, SplAdder is recommended to be run on cohorts (or sets of samples). Then for each sample an augmented splicing graph is generated, all graphs are merged across all samples into a joint merged graph and then event calling and quantification happens on this joint graph.

How this process can be parallelised across samples is described in the documentation.

Please let me know in case I mis-understood your use case.

Best,

Andre

qingwangchen commented 3 years ago

Dear @qingwangchen,

Thank you very much for reaching out. To answer your question, I would like to better understand your use case. If I understand correctly, you are running SplAdder on several single samples in parallel and identify alternative splicing events for this sample. Then you would like to merge all events across all samples into a joint table. Is this understanding correct?

The main issue with this setup would be the following. SplAdder will only detect alternative splicing events if the event is used alternatively (that is there is read support for either path in the event) in the given sample. That means, looking at an exon skipping event, you would need to see read support for both inclusion and exclusion of the event to be reported.

To work around this, SplAdder is recommended to be run on cohorts (or sets of samples). Then for each sample an augmented splicing graph is generated, all graphs are merged across all samples into a joint merged graph and then event calling and quantification happens on this joint graph.

How this process can be parallelised across samples is described in the documentation.

Please let me know in case I mis-understood your use case.

Best,

Andre

Dear Andre,

Thanks very much for your reply. You didn't misunderstand what I meant. I have also tested the relevant processes according to SplAdder manual on large cohort computing (the link you provided). The process can be tested in a dataset of 20 samples. However, when I increased the number of samples to 192, it has not been able to run through (so far, it has cost more than 10 days in the last step "Event Calling", but no final results have been generated).

The reason why I put forward this question is that, in practice, there are more than 900 samples in our cohort that need to use SplAdder quantitative AS events, unfortunately, the performance of our cluster does not seem to meet the requirements of software. Merging the single splicing graph of 520 samples took a week and occupied a large amount of memory of the machine. The final memory shortage would cause the interruption of the merging graph process. Therefore, we hope to solve the computing problem through downstream merging, but it seems not a feasible method.

Because I don't know much about the software written in python, I still don't know the similarities and differences between the calculation of AS events in one sample and the calculation in multiple samples. What is the principle? I hope you would like to answer the question if it's OK.

Nevertheless, I still have the following problems.

If we want to use SplAdder to quantify AS events for more than 900 samples, what are the configuration requirements for the cluster (HPC)?
Is the currently calculated duration normal? For example, the 192 samples I mentioned here took more than 10 days in the "Event Calling" phase, but no final results were produced. Or the single splicing graph that merged 520 samples in one week (7 days) mentioned here.
Do you have any good solutions to the problems we encounter in computing resources (mainly including "Merged graph" and "Event Calling" part)?

I hope I have described my question clearly and look forward to your reply. Thank you very much!

Best wishes, Qingwang

akahles commented 3 years ago

Dear Qingwang,

Thank you very much for getting back with more details. For very large cohorts, there is a couple of additional commands available, that allow for additional parallelisation.

First, merging of single splice graphs. In the default configuration in the merge_graphs step (as described here ), the graphs of the individual samples are merged linearly into each other. If you have many samples, this can take some time. As an alternative, you can build a merge tree. You invoke this with the parameter --chunked-merge, which gets 4 additional inputs: LEVEL MAX_LEVEL START END. These parameters describe which part of the merge tree is currently being computed. Staying with your example of 192 samples, you could merge chunks of 8 samples at a time. That is the lowest level of your tree has 24 jobs, merging 8 samples each. The next level has 3 jobs merging 8 samples each and the top level, just merges the 3 level-2 intermediates. As SplAdder can not send off its own jobs, you need to compute (and organise) the merge tree jobs externally using CWL. To stay with the example, a first merge job on the lowest level would get --chunked-merge 1 3 0 8.

Now to event calling. If the graphs are very complex, this can take a while. I have addressed this in the current development branch, where the event calling is now done using Numba, which is much faster. Until then, you can just choose to ignore the very complex graphs, by lowering --ase-edge-limit from its default of 500.

Internally, we use a Nextflow workflow to orchestrate these jobs. I will try to make this public along with the documentation, so it can be used as an example.

Best,

Andre

qingwangchen commented 3 years ago

Dear Andre,

Thanks for your suggestion, but it seems that I have encountered new problems.

For event calling, just like @mhjiang97, when I add the --ase-edge-limit 250 parameter, the same error will be reported as follow. #109

confidence 2 / sample 0
Loading gene structure from ./spladder_out/spladder/genes_graph_conf2.merge_graphs.pickle ...
... done.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/site-packages/spladder/alt_splice/detect.py", line 373, in detect_wrapper
    return (detect_intronreten(genes, gidx, log, edge_limit), idx)
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/site-packages/spladder/alt_splice/detect.py", line 128, in detect_intronreten
    if edges.shape[0] > edge_limit:
TypeError: '>' not supported between instances of 'int' and 'str'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/bin/spladder", line 8, in <module>
    sys.exit(main())
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/site-packages/spladder/spladder.py", line 190, in main
    options.func(options)
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/site-packages/spladder/spladder_build.py", line 252, in spladder
    collect_events(options)
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/site-packages/spladder/alt_splice/collect.py", line 76, in collect_events
    idx_intron_reten, intron_intron_reten = detect_events(genes, 'intron_retention', sp.where([x.is_alt for x in genes])[0], options)
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/site-packages/spladder/alt_splice/detect.py", line 391, in detect_events
    tmp = result.pop(0).get()
  File "/mnt/home_ssd/CBCGA/AS/miniconda3/envs/spladder/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value

I wonder the reason for this err and how to fix it? Looking forward to your reply, thank you very much!

Best,

Qingwang

qingwangchen commented 3 years ago

Dear Andre,

There are also some problems about --chunked-merge.

The command I used as follow.

# merge_graph.sh
# this script should be runned in the origin dir.

date
echo "merge_graph begins"

# nt=$(nproc)

spladder build -o ./spladder_out \
         -a /mnt/home_ssd/CBCGA/AS/gtf/Homo_sapiens.GRCh38.103.gtf \
         -b ./data/alignment.txt \
         --merge-strat merge_graphs \
         --parallel 18 \
         --confidence 2 \
         --readlen 150 \
         --chunked-merge 1 3 0 20 \
         --no-extract-ase

date
echo "merge_graph ends"

The task can start normally, but it seems to not merge by level during the run. Details can be found below, why is this?

2021年 09月 23日 星期四 09:39:40 CST
merge_graph begins
./spladder_out/spladder/genes_graph_conf2.187.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.365.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.366.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.393.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.405.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.407.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.457.sorted.pickle - All result files already exist.
./spladder_out/spladder/genes_graph_conf2.82.sorted.pickle - All result files already exist.
...
merging level 1 chunk 0 to 20
Loading ./spladder_out/spladder/genes_graph_conf2.187.sorted.pickle ...
... done (1 / 20)
Loading ./spladder_out/spladder/genes_graph_conf2.365.sorted.pickle ...
... done (2 / 20)
Processing ...
. 0/60666
. . . . . . . . . . 1000/60666
...
Loading ./spladder_out/spladder/genes_graph_conf2.366.sorted.pickle ...
... done (3 / 20)
Processing ...
. 0/60666
...
Loading ./spladder_out/spladder/genes_graph_conf2.B09478_TT.sorted.pickle ...
... done (12 / 20)
Processing ...
. 0/60666
...

Loading ./spladder_out/spladder/genes_graph_conf2.B09490_TT.sorted.pickle ...
.
Processing ...
. 0/60666
...

Loading ./spladder_out/spladder/genes_graph_conf2.B09518_TT.sorted.pickle ...
... done (14 / 921)
Processing ...
. 0/60666
...
Loading ./spladder_out/spladder/genes_graph_conf2.B10097_TT.sorted.pickle ...
... done (16 / 921)
Processing ...
. 0/60666
...
. . . . . . . . . . 60000/60666
. . . . . . ... done

I hope to get your help, thanks a lot!

Best,

Qingwang

akahles commented 2 years ago

Dear @qingwangchen ,

sorry for the late reply. The bug with the edge-limit will be fixed with the upcoming release.

Regarding your observation with the chunked merge. You would need to call the different merging levels from the outside. That first complete merging of level one, then the merging of level two and so on. It is on my to-do list to provide a workflow description for this (e.g., via Snakemake) so this can be better automatized.

Best,

Andre

akahles commented 2 years ago

Closing this now, please re-open if still an issue.

zyh4482 commented 4 months ago

@akahles Hi. Thanks for your kindly explanation on this issue.

I met a similar situation. I'm working on very large cohort (> 10000 samples). However, our storage resource is limited (The size of aligned reads for the whole cohort is ~100TB). I have to process a few samples first, store the output, release the space, and then perform the same pipeline for other bam files.

In such case, I'm not able to provide the complete bam files for SplAdder in merge_graph step. May I ask how to deal with this issue? Is it correct to use the single graph with --merge-strat single and move to quantification?

Thank you very much.

ratschlab / spladder