Recommendation on running MOB-recon on ~700k assemblies

Dear developers,

Thanks a lot for this tool and your ongoing support. I built a snakemake pipeline to run mob_recon on ~700k assemblies. To speed up this execution but also to not overload the cluster, I batched the assemblies in sets of 1k, and thus we have at most ~700 mob_recon runs simultaneously in a given time (each run actually is just a for running mob_recon serially in each of the 1k assemblies of the batch). I haven't fired the pipeline yet, as I have some concerns about the lock file. The pipeline automatically creates a conda environment containing mob-suite so the database is already installed inside the env. From what I understand reading https://github.com/phac-nml/mob-suite/blob/1d735b30053b45457a59c277c8d996ab86e0347c/mob_suite/mob_recon.py#L1095-L1099 and https://github.com/phac-nml/mob-suite/blob/1d735b30053b45457a59c277c8d996ab86e0347c/mob_suite/utils.py#L415-L475

for my use case, 1 of the 700 processes will be able to get the lock and check the database integrity while the rest will wait some random time between 10 and 60 seconds to retry, giving up at 10 minutes. In the best scenario where every process always waits just 10 seconds, only 60 processes will be able to check the database and proceed, while the other 640 will fail after 10 minutes waiting for the lock. And this is just for the 1st iteration out of the 1k in the batch, so it does not look like it will work in practice.

I am wondering if my reasoning above is correct and what would you propose to run hundreds of parallel jobs of mob_recon. For now, my solution is to forcibly make the mob-suite databases directory unwritable, emulating a read-only filesystem to avoid creating the lock (https://github.com/phac-nml/mob-suite/pull/89). However, I am wondering if you have a better solution, and if a --no-lock parameter would be interesting to mob-suite.

Cheers

phac-nml / mob-suite

Recommendation on running MOB-recon on ~700k assemblies #108