statisticalbiotechnology / quandenser

QUANtification by Distillation for ENhanced Signals with Error Regulation
Apache License 2.0
9 stars 1 forks source link

Scaling of runtime #10

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

Hello,

Can you give me an idea of how the runtime might vary according to the number of samples/features/spectra?

I note that the RT alignments are done pairwise, so I imagine that processing here may have quadratic complexity?

I am currently running 54 files, each with ~150k features, and 70k MS2s. I've got 48 cores and 124 Gb, but wondering if it might exceed the 24 hour batch job time!

Thanks,

Andrew

MatthewThe commented 4 years ago

Hi Andrew,

TL;DR: It should run within 24 hours

We actually construct a minimum spanning tree for the pairwise alignments, so timing should technically scale linear rather than quadratic. There are some parts that scale quadratic, but I don't think those will be a problem for 54 files.

For "vanilla" Quandenser, I once did some calculations that came down to about 20 minutes per file on a 4-core machine, so this would result in about 18 hours for 54 files. However, I take it that you're using the Singularity container, which contains some extra parallelizations, such as running alignments that are independent in parallel. This can greatly reduce the runtime, depending on the layout of the minimum spanning tree, the speedup can even be proportional to the number of "forks" (params.parallel_quandenser_max_forks in the nf.config file) as most parallelizations are embarrassingly parallel.

andrewjmc commented 4 years ago

Great, thanks. I'll try the parallel quandenser mode. Will this be reincorporated into standalone quandenser?

Currently running, although seems to have stalled on the 23rd file for dinosaur. Will keep an eye.

MatthewThe commented 4 years ago

We don't have plans to include these parallelizations in standalone quandenser, though if you have a specific need for it, we can certainly look into it.

Strange that it's stalling with dinosaur, this used to happen when we assigned too little RAM to the JVM (--dinosaur-memory), but the default of 24GB should be plenty. How many forks did you use, I guess you can also run out of memory if you use more than 4 forks on your 124GB RAM machine.

andrewjmc commented 4 years ago

Just worked it out -- the particular mzML was truncated because msconvert seems unable to convert it (predictable hang). Rerunning now without it.

I guess I'll always say "yes please" to forking parallelisation if it might speed something up on a large machine... however, if the individual tasks are well parallelised to 48 threads, this might not speed up.

Thanks again,

Andrew

andrewjmc commented 4 years ago

I've got it running successfully in singularity-pipeline with the offending mzML removed (and another removed because dinosaur was crashing in targeted mode, will try to solve that another time). It's taking longer than I anticipated with the percolator rounds (100 Gb with 48 cores) and I think I'll hit the 24 hour job limit.

Some of the sample pairings took only 5-7 minutes, but others are taking over 30 minutes (and there are just over a hundred pairings). I note that some of the psms.pout.dinosaur_targets.tsv files have only a few hundred thousand lines, but others have up to 5 million lines. The distribution of sizes seems almost bimodal. Is this expected?

Also, I moved the mzMLs to the node's filesystem but left the output directory in my (slower) personal filesystem. This is handy because it means if the job is terminated, I still have partial output. Do you expect this will slow the execution much?

Thanks again for your help,

Andrew

andrewjmc commented 4 years ago

Just had a brainwave -- could my long runtime be to do with max-missing being 100%? This is vital for my approach (seeking features unique to a group but which may only be present in a handful of samples).

If so, any tips for accelerating? How much is filesystem latency going to be an issue (my quandenser_pipeline output directory is on shared storage, so I get the files even if the job terminates -- our implementation of PBS does not support staging).

MatthewThe commented 4 years ago

The max-missing 100% should not be the problem, this filter is only applied at the last stage when all alignments have been completed.

It is expected that the alignments become gradually slower and that the psms.pout.dinosaur_targets.tsv become larger, as more and more features are considered the higher up the alignment tree we get (and even more so when we go down the tree again). I did not expect that this would slow down processing as much as it did in your case, but apparently I underestimated this effect. Do you expect to process even larger datasets than the one you're looking at right now? I had some ideas on filtering the "hopeless" features in intermediate alignments, but never had the need to actually do it, but this might be a good test case.

andrewjmc commented 4 years ago

OK, this is helpful to understand. Thanks.

I have three batches of MS runs. This is a small batch (53 RAW files), but they are large (around 2 Gb) with 70k MS2s each and many features (deliberately large protein load, as interested in detection of low abundance features rather than quantitation).

A subsequent batch is the same samples run in a different lab, with lower protein load, smaller files (around 1 Gb) and fewer MS2s (around 25k). The final batch is a new set of samples of similar run characteristics, but more samples (~200).

So I am anticipating increases in run length!