Closed leoisl closed 2 years ago
In version 3.1.0 I have removed the lock file functionality entirely. The danger here is that when NCBI updates its taxonomy, it can cause a cascade of jobs trying to update the taxonomy database simultaneously. This is an issue with ETE3 and I have put in a request to change this behaviour. When I would submit large scale jobs with the lockfile, I would put a sleep between each job submission until the cluster was saturated and then found that probabilistically I didn't hit collisions that often. However, v. 3.1.0 may be a better solution for you and it will be released in the next week or so
Dear developers,
Thanks a lot for this tool and your ongoing support. I built a
snakemake
pipeline to runmob_recon
on ~700k assemblies. To speed up this execution but also to not overload the cluster, I batched the assemblies in sets of 1k, and thus we have at most ~700mob_recon
runs simultaneously in a given time (each run actually is just afor
runningmob_recon
serially in each of the 1k assemblies of the batch). I haven't fired the pipeline yet, as I have some concerns about the lock file. The pipeline automatically creates aconda
environment containingmob-suite
so the database is already installed inside the env. From what I understand reading https://github.com/phac-nml/mob-suite/blob/1d735b30053b45457a59c277c8d996ab86e0347c/mob_suite/mob_recon.py#L1095-L1099 and https://github.com/phac-nml/mob-suite/blob/1d735b30053b45457a59c277c8d996ab86e0347c/mob_suite/utils.py#L415-L475for my use case, 1 of the 700 processes will be able to get the lock and check the database integrity while the rest will wait some random time between 10 and 60 seconds to retry, giving up at 10 minutes. In the best scenario where every process always waits just 10 seconds, only 60 processes will be able to check the database and proceed, while the other 640 will fail after 10 minutes waiting for the lock. And this is just for the 1st iteration out of the 1k in the batch, so it does not look like it will work in practice.
I am wondering if my reasoning above is correct and what would you propose to run hundreds of parallel jobs of
mob_recon
. For now, my solution is to forcibly make themob-suite
databases directory unwritable, emulating a read-only filesystem to avoid creating the lock (https://github.com/phac-nml/mob-suite/pull/89). However, I am wondering if you have a better solution, and if a--no-lock
parameter would be interesting tomob-suite
.Cheers