Closed aakarsh-anand closed 3 years ago
Your config file is fine. I think the main reason is your work directory exceeds the 5GB quota, so it cannot create any new directory.
Unable to create folder=/shared/home/aakarshanand/work/b0/434036cc6db64f5321a6a19bd58c01 -- check file system permission
Can you check your work directory /shared/home/aakarshanand/
usage?
This isn't a call-sSNV issue, but rather a file permission issue.
I think @maotian06's solution should resolve the issue.
The submission script should guarantee that Nextflow's work directory will be in /scratch/XXXXXXX through this lines: https://github.com/uclahs-cds/tool-submit-nf/blob/main/submit_nextflow_pipeline.py#L79-L81 so this behavior looks odd.
I just tried using the submission script and the work directory indeed appeared in $HOME, not /scratch/XXXXXXX. @tyamaguchi-ucla This is an issue that will need to be addressed, since if the work directory is in $HOME and not /scratch that means pipelines are not the fast ssd, which would obviously be costly.
Interestingly, if I try to run the same command (that the submission command wraps) manually in an interactive node:
TEMP_DIR=$(mktemp -d /scratch/XXXXXXX) && cd $TEMP_DIR && nextflow run /path/to/call-sSNV.nf -config /path/to/mini-test-str.config
the work directory appears in /scratch/XXXXXXX. So it might be something to do with slurm.
Closing this issue since further discussion belongs in the tool-submit-nf repo.
Yup, the default is $HOME (quota - 5GB) but we can use the workDir
option in Nextflow. I noticed that call-sSNV didn't have this option and created a task yesterday.
See NF template (and other pipelines) - https://github.com/uclahs-cds/template-NextflowPipeline/blob/0a8aa88e015e7d1f9941c6a76b5ebcea622968d8/pipeline/config/methods.config#L94
@aakarsh-anand For now, can you add workDir = "/scratch"
in the method.config and see if it works? @maotian06 is working on the task and we will create a new release.
Yup, the default is $HOME (quota - 5GB) but we can use the workDir option in Nextflow.
Yes, but the idea of the submission script is to avoid needing this option. Using workDir = "/scratch"
is not safe, since if multiple pipelines are running on the same node, they will both share the same directory. (This is unlikely since Nextflow generates a hash for its work directory, but still an issue) In addition, Nextflow also generates .nextflow.log files and other nextflow files in the .nextflow directory that may have conflicts when > 1 pipeline is running if the submission script is broken. This is why the submission script names the TEMP_DIR as /scratch/XXXXXXX and executes the pipeline from that directory.
Using
workDir = "/scratch"
is not safe, since if multiple pipelines are running on the same node, they will both share the same directory. (This is unlikely since Nextflow generates a hash for its work directory, but still an issue)
Yeah, that's why our submission script has the --exclusive
option to avoid the issue. We don't run multiple pipelines on the same node at once.
Also, the epilog script will wipe /scratch anyway with the current settings if 2 or more sbatch jobs are running on the same node. (we want to fix this issue tho)
Other pipelines except call-sSNV use the workDir
option in the method.config
and never had the issues as far as I know.
In addition, Nextflow also generates .nextflow.log files and other nextflow files in the .nextflow directory that may have conflicts when > 1 pipeline is running if the submission script is broken.
What do you mean by if the submission script is broken.
?
Yeah, that's why our submission script has the --exclusive option to avoid the issue. We don't run multiple pipelines on the same node at once.
I get that we generally don't want to do this, but what if there is a case where it is a good idea to? One example is somatic sniper - it doesn't use more than 1 cpu at a time, but needs more memory than is available on an F2 node.
Also, the epilog script will wipe /scratch anyway with the current settings if 2 or more sbatch jobs are running on the same node. (we want to fix this issue tho)
Maintaining the /scratch/XXXXXXX functionality in the submission script means a specific directory in scratch can be wiped - not the entire thing.
That said, I agree, adding workDir
fixes the issue given that we always use --exclusive
.
What do you mean by if the submission script is broken.?
The submission script is not executing jobs in /scratch/XXXXXXX like it is supposed to.
Can you check your work directory
/shared/home/aakarshanand/
usage?
I checked this location and found a folder called 'work' with many folders in it (and many genome files), which is probably the cause for the lack of memory. I can just delete this folder for now, right?
For now, can you add
workDir = "/scratch"
in the method.config and see if it works?
Should I just try this for now until we have a more permanent solution?
Can you check your work directory
/shared/home/aakarshanand/
usage?I checked this location and found a folder called 'work' with many folders in it (and many genome files), which is probably the cause for the lack of memory. I can just delete this folder for now, right?
For now, can you add
workDir = "/scratch"
in the method.config and see if it works?Should I just try this for now until we have a more permanent solution?
I think you can delete the whole 'work' file to run the pipeline now. I think that should work for you right now. I will run a test run for 'workDir = "/scratch" ' tomorrow to see if it works and update you.
It seems that @RoniHaas is having the same issue.
Should I try to run with'workDir = "/scratch"
?
@tyamaguchi-ucla
Should I try to run with
'workDir = "/scratch"
? @tyamaguchi-ucla
That should work as a quick fix.
@RoniHaas make sure to clean up your home directory as well. The quota is 5GB as we discussed in the lab meeting. https://confluence.mednet.ucla.edu/display/BOUTROSLAB/Cluster+Training+Material
Should I try to run with
'workDir = "/scratch"
? @tyamaguchi-uclaThat should work as a quick fix.
@RoniHaas I think Caden fixed the submission script here:https://github.com/uclahs-cds/tool-submit-nf/pull/30
I tested it just now and it is working. Probably you can update your submission script and give it a try?
Thanks, @maotian06, and @tyamaguchi-ucla. I tried both solutions and it works (using both options). I also removed the 'work' directory that was created in my home directory.
Now I am having more problems, but not related to this issue. I will create a new issue if what I am trying now doesn't help.
I have just received a notification that my sSNV run on F72-5 was failed due to node failure:
*** JOB 24442 ON F72-5 CANCELLED AT 2021-09-28T10:10:34 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
Do you think it may be related to the incident reported by OHIA HPC Support?
For the run, I used that branch here, before it was merged
Issue resolved via https://github.com/uclahs-cds/tool-submit-nf/pull/30
Describe the bug A clear and concise description of what the bug is. Please include the following in your bug report along with any explicit errors observed
When running call-sSNV somaticsniper on a-full-P9 and a-full-S8, I get the following error on both runs:
Configs:
/hot/users/aanand/config/call-sSNV/bwa-mem/a-full-P9-sSNV.config /hot/users/aanand/config/call-sSNV/bwa-mem/a-full-S8-sSNV.config
Logs:
/hot/users/aanand/a-full-P9_sSNV.log /hot/users/aanand/a-full-S8_sSNV.log
To Reproduce Steps to reproduce the behavior:
python tool-submit-nf/submit_nextflow_pipeline.py \ --nextflow_script /hot/users/aanand/pipeline-call-sSNV/pipeline/call-sSNV.nf \ --nextflow_config /hot/users/aanand/config/call-sSNV/bwa-mem/a-full-P9-sSNV.config \ --pipeline_run_name a-full-P9_sSNV \ --partition_type F72 \ --email AakarshAnand@mednet.ucla.edu
Expected behavior A clear and concise description of what you expected to happen.
I'm confused because the pipeline completed successfully on a-full-P2 where the only main difference in the config is the tumor sample, see config here:
/hot/users/aanand/config/call-sSNV/bwa-mem/a-full-P2-sSNV.config
Any idea what could be causing this?