kentsisresearchgroup / UltraQuant

MaxQuant with snakemake and singularity workflow for open and scalable mass spectrometry data analysis on Linux computing clusters
26 stars 3 forks source link

Running on multiple nodes (SLURM) #6

Open mhaseeb123 opened 3 years ago

mhaseeb123 commented 3 years ago

Hi,

I am trying to run the UltraQuant workflow on multiple nodes (in parallel mode - like MPI) of a SLURM based cluster. but unfortunately, it keeps giving me fatal I/O errors (files missing related) when I run it on more than one nodes?

Here is the command that I am running:

snakemake --snakefile UltraQuant.sm --configfile config.yaml --cluster "srun --nodes=2 --ntasks=2 --ntasks-per-node=1 --cpus-per-task=24 -t 2:00:00 -o '/home/mhaseeb/ultraquant/UltraQuant/uquant.%j.out' -J 'uqnt_2'" maxQuant -j 48 -k --latency-wait 60 --use-singularity --singularity-args "--bind /oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890:/oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890,/home/mhaseeb:/home/mhaseeb,/oasis/scratch/comet/mhaseeb/temp_project:/oasis/scratch/comet/mhaseeb/temp_project" --ri

I get this error from the SLURM at STDOUT:

srun: error: _server_read: fd 17 n -1 got error or unexpected eof reading header: Connection reset by peer
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1
srun: error: _server_read: fd 16 n -1 got error or unexpected eof reading header: Connection reset by peer
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0

While the log file is full of errors looking like the following one:

Unhandled Exception:
System.IO.FileNotFoundException: Could not find file "/oasis/scratch/comet/mhaseeb/temp_project/uquant_temp/7Sep18_Olson_F3/p0/7Sep18_Olson_F3.peaksi"
File name: '/oasis/scratch/comet/mhaseeb/temp_project/uquant_temp/7Sep18_Olson_F3/p0/7Sep18_Olson_F3.peaksi'
  at System.IO.FileStream..ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.Boolean anonymous, System.IO.FileOptions options) [0x0019e] in <b0e1ad7573a24fd5a9f2af9595e677e7>:0
  at System.IO.FileStream..ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share) [0x00000] in <b0e1ad7573a24fd5a9f2af9595e677e7>:0
  at (wrapper remoting-invoke-with-check) System.IO.FileStream..ctor(string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare)
  at BaseLibS.Util.FileUtils.GetBinaryReader (System.String path) [0x00001] in <723eab50db594b3ea663ce1daa243f6b>:0
  at MaxQuantLibS.Data.MsUtil.ReadData (System.Double[]& centerMassArray, System.Int64[]& filePosArray, System.Double[]& intensityArray, System.Double[]& minTimeArray, System.Double[]& maxTimeArray, System.String filename, System.Boolean hasMzBounds, System.Double[]& minMzArray, System.Double[]& maxMzArray) [0x00001] in <9555896b71df485794b1d935400a4370>:0
  at MaxQuantPLibS.Data.Plist.GenericPeakListLayer.SetIndexData (System.Boolean hasMassBounds) [0x0001b] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Features.FeatureDetectionUtil.DetectPeaks (MaxQuantPLibS.Data.RunTypes.LcmsRunType lcmsRunType, BaseLibS.Ms.RawFileLayer rawFile, System.String basePath, MaxQuantPLibS.Basic.GroupParams param, System.Double minMz, System.Double maxMz, MsLib.Util.BoxCarMode mode) [0x00132] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Features.FeatureDetectionUtil.DetectFeatures (MaxQuantPLibS.Basic.MaxQuantParams mqpar, MaxQuantPLibS.Basic.GroupParams param, System.String filename, System.Boolean positiveMode) [0x00041] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Features.FeatureDetectionUtil.DetectFeatures (System.String mqparFile, System.Int32 fileIndex) [0x00101] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Work.FeatureDetection.Calculation (System.String[] args) [0x0000c] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Work.MaxQuantWorkDispatcherUtil.PerformTask (System.Int32 taskType, System.String[] args) [0x00007] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantTask.Program.Function (System.String[] args) [0x00012] in <e62191f5e06c4ebab9fd972c4406b0a4>:0
  at Utils.Util.ExternalProcess.Run (System.String[] args, System.Boolean debug) [0x00132] in <037975a4198c4de38b2b16b335e7f89e>:0
  at MaxQuantTask.Program.Main (System.String[] args) [0x00007] in <e62191f5e06c4ebab9fd972c4406b0a4>:0
[ERROR] FATAL UNHANDLED EXCEPTION: System.IO.FileNotFoundException: Could not find file "/oasis/scratch/comet/mhaseeb/temp_project/uquant_temp/7Sep18_Olson_F3/p0/7Sep18_Olson_F3.peaksi"
File name: '/oasis/scratch/comet/mhaseeb/temp_project/uquant_temp/7Sep18_Olson_F3/p0/7Sep18_Olson_F3.peaksi'
  at System.IO.FileStream..ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.Boolean anonymous, System.IO.FileOptions options) [0x0019e] in <b0e1ad7573a24fd5a9f2af9595e677e7>:0
  at System.IO.FileStream..ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share) [0x00000] in <b0e1ad7573a24fd5a9f2af9595e677e7>:0
  at (wrapper remoting-invoke-with-check) System.IO.FileStream..ctor(string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare)
  at BaseLibS.Util.FileUtils.GetBinaryReader (System.String path) [0x00001] in <723eab50db594b3ea663ce1daa243f6b>:0
  at MaxQuantLibS.Data.MsUtil.ReadData (System.Double[]& centerMassArray, System.Int64[]& filePosArray, System.Double[]& intensityArray, System.Double[]& minTimeArray, System.Double[]& maxTimeArray, System.String filename, System.Boolean hasMzBounds, System.Double[]& minMzArray, System.Double[]& maxMzArray) [0x00001] in <9555896b71df485794b1d935400a4370>:0
  at MaxQuantPLibS.Data.Plist.GenericPeakListLayer.SetIndexData (System.Boolean hasMassBounds) [0x0001b] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Features.FeatureDetectionUtil.DetectPeaks (MaxQuantPLibS.Data.RunTypes.LcmsRunType lcmsRunType, BaseLibS.Ms.RawFileLayer rawFile, System.String basePath, MaxQuantPLibS.Basic.GroupParams param, System.Double minMz, System.Double maxMz, MsLib.Util.BoxCarMode mode) [0x00132] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Features.FeatureDetectionUtil.DetectFeatures (MaxQuantPLibS.Basic.MaxQuantParams mqpar, MaxQuantPLibS.Basic.GroupParams param, System.String filename, System.Boolean positiveMode) [0x00041] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Features.FeatureDetectionUtil.DetectFeatures (System.String mqparFile, System.Int32 fileIndex) [0x00101] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Work.FeatureDetection.Calculation (System.String[] args) [0x0000c] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantPLibS.Work.MaxQuantWorkDispatcherUtil.PerformTask (System.Int32 taskType, System.String[] args) [0x00007] in <fc6a484550e74f0d9da847d8cdee3391>:0
  at MaxQuantTask.Program.Function (System.String[] args) [0x00012] in <e62191f5e06c4ebab9fd972c4406b0a4>:0
  at Utils.Util.ExternalProcess.Run (System.String[] args, System.Boolean debug) [0x00132] in <037975a4198c4de38b2b16b335e7f89e>:0
  at MaxQuantTask.Program.Main (System.String[] args) [0x00007] in <e62191f5e06c4ebab9fd972c4406b0a4>:0

I tried to explore the temp_dir and the work_dir but it seems as if the mqpar_conversion rule is only creating 1 data partition for (n0 and p0) however as per my understanding, it should at least create for 2 nodes (and 24 cores each?) assuming the workflow is designed as per MapReduce-like model?

Can me help me get around this issue? Thank you

P.S. I am not experienced with either snakemake or singularity so I am not sure if I am doing something really dumb here.

pillepalle123 commented 3 years ago

Not sure about your issue. But I don't think maxquant is supporting cross node computing. So maybe check if it runs through without error when using one node only.

mhaseeb123 commented 3 years ago

@pillepalle123

I checked and workflow seems to be running without errors on one node. But I am still unable to run it on more than one nodes.

With one node, I simply run this:

snakemake --snakefile UltraQuant.sm --configfile config.yaml --cluster "srun --nodes=1 --ntasks=1 --ntasks-per-node=1 --cpus-per-task=24 -t 2:00:00 -o '/home/mhaseeb/ultraquant/UltraQuant/uquant.%j.out' -J 'uqnt_2'" maxQuant -j 24 -k --latency-wait 60 --use-singularity --singularity-args "--bind /oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890:/oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890,/home/mhaseeb:/home/mhaseeb,/oasis/scratch/comet/mhaseeb/temp_project:/oasis/scratch/comet/mhaseeb/temp_project" --ri
mhaseeb123 commented 3 years ago

Here is another log. It seems as if the workflow is not being set up properly and the same process (with same inputs) is executing on multiple nodes causing race conditions - a process deletes or moves a file before the other one causing unhandled file exceptions. See another full log below

Building DAG of jobs...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 24
Using shell: /bin/bash
Rules claiming more threads will be scaled down.
Provided cores: 24
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       maxQuant
        1
Job counts:
        count   jobs
        1       maxQuant
        1
Select jobs to execute...
Select jobs to execute...

[Tue Feb 23 11:49:52 2021]
rule maxQuant:
    input: out/mqpar.xml
    output: out/combined/txt/summary.txt
    log: out/logs/maxQuant.txt
    jobid: 0
    benchmark: out/benchmarks/maxQuant.txt

[Tue Feb 23 11:49:52 2021]
rule maxQuant:
    input: out/mqpar.xml
    output: out/combined/txt/summary.txt
    log: out/logs/maxQuant.txt
    jobid: 0
    benchmark: out/benchmarks/maxQuant.txt

Activating singularity image /oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890/work_dir/.snakemake/singularity/79274f8c7291fda81f2362ed0688e4fc.simg
Activating singularity image /oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890/work_dir/.snakemake/singularity/79274f8c7291fda81f2362ed0688e4fc.simg
Cannot delete folder /oasis/scratch/comet/mhaseeb/temp_project/RAWPXD015890/work_dir/out/combined/proc. Please make sure no other processes are accessing it.
Configuring
[Tue Feb 23 11:49:54 2021]
Error in rule maxQuant:
    jobid: 0
    output: out/combined/txt/summary.txt
    log: out/logs/maxQuant.txt (check log file(s) for error message)
    shell:
        mono /home/mhaseeb/ultraquant/UltraQuant/MaxQuant/bin/MaxQuantCmd.exe out/mqpar.xml
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Testing files

If anyone knows how to fix this, any help would be highly appreciated :)

pillepalle123 commented 3 years ago

Yeah, but that's the thing. Maxquant just doesn't support running in Parallel on several nodes. That's why I believe you can only use one node.