Continue canceled job without calculating the distances again

mihkelvaher commented 4 years ago

Is your feature request related to a problem? Please describe. I've stopped the tree building process and I'd like mashtree to continue where it left off. Creating msh files again is skipped but what about distances already added to the sql database? If existing entries to the database are already checked, then I've got some other issue and this one can be closed.

Describe the solution you'd like If distances.sql exists, exclude all of the comparisons already done. This is probably a bit tricky because the other genomes are given in the mshList.txt

Describe alternatives you've considered 1) Print out the whole database, check which msh is already compared to all other mshs and remove the line from mshList.txt. 2) Start again, delete the database but keep the msh files and hope no further interruptions (currently due to optimization) are needed.

Additional context I'm running mashtree on a HPC with 15k bacterial genomes and trying to optimize the resource allocation. The latest issue is mashtree grinding to a halt (probably) because of mash subprocesses are using up all 1024 file descriptors (handles).

lskatz commented 4 years ago

@mihkelvaher are you running this with 1024 threads? It seems like 1024 disc I/O processes could be your bottleneck. Mash dist has been very efficient and so I haven't had any huge reason to optimize this step.

mihkelvaher commented 4 years ago

The max I've run it was 40 threads and using lsof I got ~1300 pipes/files that were currently read from or written to. BUT this seems is not the problem. Running only on 8 cpus while trying to continue, the cpu usage also started to drop (this from a previous job with the same behaviour): Screenshot 2019-12-11 at 16 00 08 Using 10 cpus and deleting distances.sqlite I managed to get to the step mashtree: mashDistance: Converting to phylip format into .... CPU usage is ~100% and memory usage is slowly increasing. Hopefully, this one finishes. Currently, both the 8 cpu and 10 cpu jobs are running while the 10 cpu job started some hours later. The last line the 8 cpu logged over 12h ago was mashtree: mashDistance: Databasing distances (1/8, TID9) and the 10 cpu job is already finished that step as mentioned before. Interestingly, the 8 cpu job has modified the existing database but I can't tell if anything has been actually added because the size was something similar in the beginning. Currently, it seems that if there's a need to continue (at least for a large job), distances.sqlite should be deleted and started from scratch. Note that a 10k genome job finishes on a non-HPC server with 10 cpus if not interrupted (haven't tried to interrupt), so this is probably not a volume issue.

lskatz commented 4 years ago

VERY interesting graph. I still can't figure out what causes the slowdown at this point but I am sorry that happens. I have an idea of what could be done at this point but honestly I cannot address it at this point without going too deep into the guts.

mihkelvaher commented 4 years ago

Currently, there's no rush because the mentioned 10 cpu job finished today! :)

As sketching is quite fast and could be skipped if the job is later continued, the temporary solution to get past this point is to delete the distances database. As the resource needs are more or less constant any problems (reasons to cancel and later continue) should appear in the beginning when only some distances are found. Therefore, this issue is not that time-consuming. Just got to let mashtree do its thing.

lskatz / mashtree

Continue canceled job without calculating the distances again #50