Parallelism on HPC clusters

rpep commented 3 years ago

Hi there,

We're trying to advise a user of our HPC cluster at the University of Birmingham who is attempting to use FlashFry for a large dataset. We just wanted to know whether the code is parallelised at all so we could advise appropriately on what resources they should request from the cluster scheduler, and if so, whether you'd done any scaling studies vs the number of cores?

Best wishes, Ryan

aaronmck commented 3 years ago

Hi Ryan,

For a cluster, the best way would be to slice the input fasta into separate files, and run FlashFry on each slice. If there are natural breakpoints, i.e. the file already has multiple contigs, this is trivial and could be run in as many nodes as you'd like. If there's not a natural breakpoint you could make sure there's slight overlap (a CRISPR target length +1) between the subsetted fasta regions. The resulting tables would then have to be merged at the end. Let me know if this doesn't answer the question. Good luck!

rpep commented 3 years ago

Thanks! That's excellent - I'll pass this feedback on to the user and hopefully it will make sense to them.

Best wishes, Ryan

mckennalab / FlashFry

Parallelism on HPC clusters #23