luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Running Octopus on a Cluster #83

Closed JesseDGibson closed 4 years ago

JesseDGibson commented 4 years ago

Hi, I was wondering if you had any recommendations for running octopus on a cluster? To take advantage of multi threading, is it best to run octopus as a single command on a single node with many cores? Could you break up calling with multiple commands running on different nodes focused on calling different regions or would the model not work as well when focusing on a subset instead of the entire sequenced region?

dancooke commented 4 years ago

Under most circumstances I'd recommend using Octopus' built-in multithreading. Whether to use a distributed computing approach (i.e. splitting over multiple nodes) really depends on the specification of your cluster and intended workload. If you individual nodes are relatively low-spec but you have many of them then you may find some benefits from distributing jobs across nodes. On the other hand, if you intend to run many jobs in parallel (such that the cluster would be full anyway), then probably this advantage will disappear. As is usually the case with the type of 'which gives better performance?' questions, the only way to know for sure is to benchmark.

In terms of call consistency, it's not guaranteed that calls generated from a multithreaded command will give identical calls to a single threaded command. Nor is it guaranteed that calls will be identical if the amount of memory resources change. This is because these options determine how the input calling regions are internally split into sub-jobs, and due to the haplotype nature of the algorithm, this can result in boundary artefacts. You'll find the same issue if you manually split calling regions, but then you loose some protections against boundary artefacts that Octopus employs when it does the splitting itself. In addition, a subsample of the input calling regions are used to calculate various statistics about the input reads (e.g. median depth), so if you manually split these regions then these statistics may differ across commands creating another source of inconsistency. In summary, although neither approaches are guaranteed to result in consistent calls (to using a single-threaded command with all calling regions), using Octopus in multi-threaded mode makes inconsistency less likely.

JesseDGibson commented 4 years ago

Awesome, thanks for the very thorough response! I think I'll try to stick to using the built-in multi-threading