iqbal-lab-org / pling

Plasmid analysis using rearrangement distances
MIT License
25 stars 1 forks source link

Batching #42

Closed babayagaofficial closed 7 months ago

babayagaofficial commented 7 months ago

This PR implements several big changes:

  1. Batching: Pairwise jobs (such as running nucmer, or calculating DCJ distances) are now done batchwise. A batch size is defined by the user (default=50), and every pair is assigned a batch. Rules make_unimog in align_snakemake, pairwise_seq_jaccard in jac_network_snakemake and DCJ calculation have been changed for calculating in batches.
  2. Sourmash prefiltering: Added (optional) prefiltering of pairs with sourmash. Basically, it calculates containment from sourmash for everything in one go, and then batches are defined only using pairs that pass the sourmash_threshold. The sourmash_threshold is defined in terms of Jaccard similarity, so 1-containment. Default is currently 0.85, to compliment default jaccard_similarity threshold of 0.6. The default was chosen by staring at this graph (sourmash vs nucmer containment): smash_cor

Sourmash tends to underestimate, so I chose a threshold that looked like it wouldn't accidentally throw out pairs that actually do meet the Jaccard threshold.

  1. Reduction of intermediary and output files: make_unimogs now output one unimog and one mapping file per batch. For DCJ calculation, creating the ILP from a unimog, running a solver, and outputting the distance are joined into one single rule that runs each step sequentially for a batch, i.e. rules glpk_and_ding and gurobi_and_ding. These rules output a single file, which is a tsv of the DCJ distance for each pair in the batch. DCJ calculation used to also output a matching for every pair -- I decided not to output these anymore, as we've never used it in any subsequent analysis, and I doubt many users would be interested. DCJ calculation creates a lot of intermediary files -- an ILP file and a solution file for each pair. For GLPK there doesn't seem to be any way to avoid this, so I've just tried to mediate this by making glpk_and_ding a shadow rule. For Gurobi I've managed to avoid any intermediary file creation by using Ding functions directly together with the Gurobi Pyhon API. Frankly, it's a bit convoluted and it's the change I'm most worried about breaking things, but it has passed all the tests I've thrown at it so far. And it should lessen the stress on the cluster by a fair bit.
  2. Final output is just a tsv, no more distance matrix.

Also wrt merging batching into main, even on my toy tests with 6 plasmids with 3 or 4 genes it feels like batching speeds things up, at least a little bit (haven't outright checked this, but it feels like it). So it seems to me that it's the right approach even for smaller datasets, so we should just always do batching.

Closes #26