Continuing interrupted jobs

andrewjmc commented 4 years ago

Hello,

Is there any way to pick up a job where it was interrupted by running out of time on a cluster? My job with 52 mzMLs on a node with 48 cores and 100 Gb RAM hit thr 24 hour limit and was 75% through the targeted dinosaur / percolator steps (taking longer than anticipated https://github.com/statisticalbiotechnology/quandenser/issues/10).

Thanks,

Andrew

TimothyOlsson commented 4 years ago

Hi Andrew,

I apologise for the very late response.

Yes, I added a way to resume runs. In the third tab, at the right side of the GUI, there is a button which should say "resume directory" or something similar. That directory should be the "Quandenser_output" directory that you want to resume. I can be a bit tricky to find the right directory, but in the stdout.txt, the directory you are looking for should be there or in your out directory. It will either be in the work directory (work/XX/XXXXXXXXX...) if you ran the pipeline in non-parallel mode, while in parallel mode, if the output is set to be "published" in the third tab, Quandenser_output should be in your output directory. Note that if you have changed any files or labels in the GUI, you should choose the files as "file_list.txt" in the output directory, which should give you some information about the files you used.

andrewjmc commented 4 years ago

No apologies -- you should have a weekend!

Great -- still running from command line, but investigating this.

I delved into the Quandenser source code to see how "granular" the rerunning is [present job failed due to HPC problem outside my control at the cluster consensus step near the end], and it seems that all the time consuming steps will not be rerun.

However... for speed (I believe), in my job script, I copied all the mzMLs to to node's TMPDIR for running. Therefore, the work directory features an mzML directory full of broken links, and the list.txt file points to a non-existent TMPDIR.

I fear destroying all the hard work done so far with a failed rerun. How do you advise proceeding?

Best wishes,

Andrew

andrewjmc commented 4 years ago

I guess I'm hoping that if I rerun the file list will still get overwritten and the symbolic links regenerated (though I can't see where in quandenser or quandenser_pipeline these are created) and everything will be fine. Copying the work directory and trying...!

andrewjmc commented 4 years ago

I'm running it now, and notice it creates a new output directory - is that expected?

Otherwise, so far so good, with many Already processed messages!

Best wishes,

Andrew

andrewjmc commented 4 years ago

However, it has now appeared to stall after Deserialized spectrum to precursor map with 100% usage of a single core, but no further output. I am expecting to see "Processing line 100000" from MaRaClusterIO::parseClustersForRTimePairs

TimothyOlsson commented 4 years ago

Hi Andrew,

Yes, a new output directory should be expected. Since it said "Already processed", that means you got it to load properly, which is great!

I have seen the freezing in maracluster a couple of times before when I started doing the parallel processing for Quandenser and it was sometimes coupled with issue #15. Matthew The made some changes to the Quandenser code later, which made the problem disappear. When I induced crashes in the pipeline to test the robustness a couple of months ago, it could usually recover, but would very rarely hang or crash. It could be because of a corrupted file after the crash. Sadly, my fix for that was to either delete or move the maracluster diretory and/or the percolator directory from Quandenser_output. The dinosaur directory will mostly be unaffected from a crash from what I have seen.

The non-parallel run of Quandenser (aka the "normal" run) is slightly more robust on crashes. I hope you get it to work with your previous run. If it crashes or hangs again, I could try to run your dataset on a HPC cluster I have access to, depending if you can share the data or not. If you want to run the full pipeline (aka tide-search + triqler), I would also need a fasta file with proteins and the correct labels for the files (which should be file_list.txt).

statisticalbiotechnology / quandenser-pipeline

Continuing interrupted jobs #27