Unable to create folder -- check file system permission

aakarsh-anand commented 3 years ago

Describe the bug A clear and concise description of what the bug is. Please include the following in your bug report along with any explicit errors observed

When running call-sSNV somaticsniper on a-full-P9 and a-full-S8, I get the following error on both runs:

Error executing process > 'somaticsniper:samtools_varfilter (1)'

Caused by:
  Unable to create folder=/shared/home/aakarshanand/work/b0/434036cc6db64f5321a6a19bd58c01 -- check file system permission

Configs:

/hot/users/aanand/config/call-sSNV/bwa-mem/a-full-P9-sSNV.config /hot/users/aanand/config/call-sSNV/bwa-mem/a-full-S8-sSNV.config

Logs:

/hot/users/aanand/a-full-P9_sSNV.log /hot/users/aanand/a-full-S8_sSNV.log

To Reproduce Steps to reproduce the behavior: python tool-submit-nf/submit_nextflow_pipeline.py \ --nextflow_script /hot/users/aanand/pipeline-call-sSNV/pipeline/call-sSNV.nf \ --nextflow_config /hot/users/aanand/config/call-sSNV/bwa-mem/a-full-P9-sSNV.config \ --pipeline_run_name a-full-P9_sSNV \ --partition_type F72 \ --email AakarshAnand@mednet.ucla.edu

Expected behavior A clear and concise description of what you expected to happen.

I'm confused because the pipeline completed successfully on a-full-P2 where the only main difference in the config is the tumor sample, see config here:

/hot/users/aanand/config/call-sSNV/bwa-mem/a-full-P2-sSNV.config

Any idea what could be causing this?

maotian06 commented 3 years ago

Your config file is fine. I think the main reason is your work directory exceeds the 5GB quota, so it cannot create any new directory. Unable to create folder=/shared/home/aakarshanand/work/b0/434036cc6db64f5321a6a19bd58c01 -- check file system permission Can you check your work directory /shared/home/aakarshanand/ usage?

bugh1 commented 3 years ago

This isn't a call-sSNV issue, but rather a file permission issue.

I think @maotian06's solution should resolve the issue.

The submission script should guarantee that Nextflow's work directory will be in /scratch/XXXXXXX through this lines: https://github.com/uclahs-cds/tool-submit-nf/blob/main/submit_nextflow_pipeline.py#L79-L81 so this behavior looks odd.

I just tried using the submission script and the work directory indeed appeared in $HOME, not /scratch/XXXXXXX. @tyamaguchi-ucla This is an issue that will need to be addressed, since if the work directory is in $HOME and not /scratch that means pipelines are not the fast ssd, which would obviously be costly.

Interestingly, if I try to run the same command (that the submission command wraps) manually in an interactive node:

TEMP_DIR=$(mktemp -d /scratch/XXXXXXX) && cd $TEMP_DIR && nextflow run /path/to/call-sSNV.nf -config /path/to/mini-test-str.config

the work directory appears in /scratch/XXXXXXX. So it might be something to do with slurm.

Closing this issue since further discussion belongs in the tool-submit-nf repo.

tyamaguchi-ucla commented 3 years ago

Yup, the default is $HOME (quota - 5GB) but we can use the workDir option in Nextflow. I noticed that call-sSNV didn't have this option and created a task yesterday.

https://dev.azure.com/boutroslab/Infrastructure/_sprints/taskboard/Infrastructure%20Team/Infrastructure/Sprint%2073?workitem=2313

See NF template (and other pipelines) - https://github.com/uclahs-cds/template-NextflowPipeline/blob/0a8aa88e015e7d1f9941c6a76b5ebcea622968d8/pipeline/config/methods.config#L94

tyamaguchi-ucla commented 3 years ago

@aakarsh-anand For now, can you add workDir = "/scratch" in the method.config and see if it works? @maotian06 is working on the task and we will create a new release.

bugh1 commented 3 years ago

Yup, the default is $HOME (quota - 5GB) but we can use the workDir option in Nextflow.

Yes, but the idea of the submission script is to avoid needing this option. Using workDir = "/scratch" is not safe, since if multiple pipelines are running on the same node, they will both share the same directory. (This is unlikely since Nextflow generates a hash for its work directory, but still an issue) In addition, Nextflow also generates .nextflow.log files and other nextflow files in the .nextflow directory that may have conflicts when > 1 pipeline is running if the submission script is broken. This is why the submission script names the TEMP_DIR as /scratch/XXXXXXX and executes the pipeline from that directory.

tyamaguchi-ucla commented 3 years ago

Using workDir = "/scratch" is not safe, since if multiple pipelines are running on the same node, they will both share the same directory. (This is unlikely since Nextflow generates a hash for its work directory, but still an issue)

Yeah, that's why our submission script has the --exclusive option to avoid the issue. We don't run multiple pipelines on the same node at once.

Also, the epilog script will wipe /scratch anyway with the current settings if 2 or more sbatch jobs are running on the same node. (we want to fix this issue tho)

Other pipelines except call-sSNV use the workDir option in the method.config and never had the issues as far as I know.

In addition, Nextflow also generates .nextflow.log files and other nextflow files in the .nextflow directory that may have conflicts when > 1 pipeline is running if the submission script is broken.

What do you mean by if the submission script is broken.?

bugh1 commented 3 years ago

Yeah, that's why our submission script has the --exclusive option to avoid the issue. We don't run multiple pipelines on the same node at once.

I get that we generally don't want to do this, but what if there is a case where it is a good idea to? One example is somatic sniper - it doesn't use more than 1 cpu at a time, but needs more memory than is available on an F2 node.

Also, the epilog script will wipe /scratch anyway with the current settings if 2 or more sbatch jobs are running on the same node. (we want to fix this issue tho)

Maintaining the /scratch/XXXXXXX functionality in the submission script means a specific directory in scratch can be wiped - not the entire thing.

That said, I agree, adding workDir fixes the issue given that we always use --exclusive.

What do you mean by if the submission script is broken.?

The submission script is not executing jobs in /scratch/XXXXXXX like it is supposed to.

aakarsh-anand commented 3 years ago

Can you check your work directory /shared/home/aakarshanand/ usage?

I checked this location and found a folder called 'work' with many folders in it (and many genome files), which is probably the cause for the lack of memory. I can just delete this folder for now, right?

For now, can you add workDir = "/scratch" in the method.config and see if it works?

Should I just try this for now until we have a more permanent solution?

maotian06 commented 3 years ago

Can you check your work directory /shared/home/aakarshanand/ usage?

I checked this location and found a folder called 'work' with many folders in it (and many genome files), which is probably the cause for the lack of memory. I can just delete this folder for now, right?

For now, can you add workDir = "/scratch" in the method.config and see if it works?

Should I just try this for now until we have a more permanent solution?

I think you can delete the whole 'work' file to run the pipeline now. I think that should work for you right now. I will run a test run for 'workDir = "/scratch" ' tomorrow to see if it works and update you.

tyamaguchi-ucla commented 3 years ago

It seems that @RoniHaas is having the same issue.

RoniHaas commented 3 years ago

Should I try to run with'workDir = "/scratch" ? @tyamaguchi-ucla

tyamaguchi-ucla commented 3 years ago

Should I try to run with'workDir = "/scratch" ? @tyamaguchi-ucla

That should work as a quick fix.

tyamaguchi-ucla commented 3 years ago

@RoniHaas make sure to clean up your home directory as well. The quota is 5GB as we discussed in the lab meeting. https://confluence.mednet.ucla.edu/display/BOUTROSLAB/Cluster+Training+Material

maotian06 commented 3 years ago

Should I try to run with'workDir = "/scratch" ? @tyamaguchi-ucla

That should work as a quick fix.

@RoniHaas I think Caden fixed the submission script here:https://github.com/uclahs-cds/tool-submit-nf/pull/30

I tested it just now and it is working. Probably you can update your submission script and give it a try?

RoniHaas commented 3 years ago

Thanks, @maotian06, and @tyamaguchi-ucla. I tried both solutions and it works (using both options). I also removed the 'work' directory that was created in my home directory.

Now I am having more problems, but not related to this issue. I will create a new issue if what I am trying now doesn't help.

RoniHaas commented 3 years ago

I have just received a notification that my sSNV run on F72-5 was failed due to node failure:

*** JOB 24442 ON F72-5 CANCELLED AT 2021-09-28T10:10:34 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

Do you think it may be related to the incident reported by OHIA HPC Support?

For the run, I used that branch here, before it was merged

bugh1 commented 3 years ago

Issue resolved via https://github.com/uclahs-cds/tool-submit-nf/pull/30

uclahs-cds / pipeline-call-sSNV

Unable to create folder -- check file system permission #39