qmode collect step for large cohorts

kate-stankiewicz commented 2 years ago

spladder version: 3.0.3
Python version: 3.9.12
Operating System: Ubuntu 20

Description

I have followed the steps for running SplAdder on large cohorts as outlined here https://spladder.readthedocs.io/en/latest/spladder_cohort.html

However, I am not sure what the second phase in the quantification step is doing? On the documentation page is says "As a second step to this phase, we need to collect the individual quantifications and aggregate them in a joint database:". However, when I run this step I get no output and no log information either (even if run in -v mode). Additionally, it only runs for about 30 seconds (all other steps in the workflow run for at least 30 mins, if not days). From what I can tell, no new files are created, nor are any existing files modified. The subsequent steps (event calling and testing) seem to run despite this. Is the step doing anything that might not be visible to me? Or is it possible to skip it?

Many thanks!

What I Did

# Here I use a job array to run each of the 83 samples in parallel
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${1} --merge-strat single --no-extract-ase --parallel 2 -v

# next merge the splice graphs
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${workdir}/bams.txt --merge-strat merge_graphs --no-extract-ase --parallel 40 -v

# next run quantification for each sample separately using a job array again
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${1} --merge-strat merge_graphs --no-extract-ase --quantify-graph --qmode single --parallel 2 -v

# aggregate them into a joint database (this seems to do nothing and produce no output/logs)
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${workdir}/bams.txt --merge-strat merge_graphs --no-extract-ase --quantify-graph --qmode collect --parallel 40 -v

# call events
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${workdir}/bams.txt --parallel 40 -v

# run 30 different tests comparing different groups of samples (again using a job array to automate running each comparison)
spladder test -o ${workdir}/array_spladder_out --out-tag Rem_Prob --conditionA ${workdir}/testing_contrasts/contrast_files/sym_${1}_${3}.txt --conditionB ${workdir}/testing_contrasts/contrast_files/sym_${2}_${3}.txt --labelA ${1}${3} --labelB ${2}${3} --diagnose-plots -v --parallel 5

akahles commented 1 year ago

Dear @kate-stankiewicz ,

sorry for the late reply. I sometimes fall behind on answering issues due to other higher priority items.

To refer to your individual steps above, I will refer to them numbered from 1 to 6.

In step 1, you generate a single splicing graph per input bam file. In the same step, you quantify the single graphs in that sample, as the --quantify-graph options is set by default. What you would need in this case is to set the --no-quantify-graph option. This will simply generate the individual sample graphs. If you don't have the --no-quantify-graph option, then the single graphs are generated in any case, but also the quantifications per single bam file for each respective graph (which you might not want).

In step 2, you merge the single sample graphs into a joint graphs. That is, all events detected in all samples will be merged per gene. Also here, you would like to add the --no-quantify-graph option, as otherwise the quantifications for the merged graph are done per sample, but in a sequential manner. (This is what you would like to parallelise in step 3.)

In step 3, you will now quantify the merged graph with each single bam file. That is, also events detected in only one sample will get quantified in all samples. This is sometimes useful, as the criteria for adding a new event to the splicing graph are quite stringent. So you might not have enough reads in a sample to add a new event, but you have enough to quantify it, if it was detected in another sample. If you do this in parallel over samples, you can speed up this process depending on how many samples you run in parallel.

In step 4, you now collect the quantifications from step 3. This should generate genes_graph_conf3.merge_graphs.count.hdf5 and genes_graph_conf3.merge_graphs.gene_exp.hdf5 (assuming you run with default confidence level 3). However, in your above run, these files were already generated by step 2. Hence nothing else needed to be done. I agree, that the verbose mode could indeed be a bit more (or at all) verbose about this ...

I hope this answers your question. Please let me know if you have additional ones.

Best, Andre

kate-stankiewicz commented 1 year ago

Hi Andre,

Thank you for this informative and in depth explanation--it is really helpful! So if I am understanding correctly, the results between running these steps in the way that I have and running them as you outlines here should not change, correct? The only thing I have missed out on is better parallelization in delaying the graft quantification until a later step?

Also, to possibly save you having to address a similar issue from another user, the --no-quantify-graph option is not listed in the steps for use on large cohorts: https://spladder.readthedocs.io/en/latest/spladder_cohort.html

Only the --no-extract-ase option is listed in the earlier steps. Just wanted to let you know since it seems to be important to include and makes sense as you explained it above.

akahles commented 1 year ago

Hi @kate-stankiewicz ,

Thanks for the hint with the documentation. I will fix this right away. Regarding you statement about completeness - yes, this is true. It only took longer than necessary, but the outputs are the same.

Best,

Andre

kate-stankiewicz commented 1 year ago

Hi @akahles ,

Thanks for the clarification! I will close this issue now.

ratschlab / spladder

qmode collect step for large cohorts #180

Description

What I Did