Closed demilappa closed 5 years ago
The error no space left on device
is generated by Picard and it means that it cannot allocate enough disk space, therefore NXF_OPTS
and memory
settings wont' help.
Not sure how Singularity handles the /tmp
directory, AFAIK it's mapped on the host storage, if that's the case the problem is that there isn't enough disk space in the node scratch storage.
Hi @demilappa,
As @pditommaso says - no space left on device
is disk space, not virtual memory.
Are you using -profile hebbe
when you run the command? If so it should be using the hebbe config file which disables the specification of process memory entirely. I was under the impression that specifying it on this cluster at all breaks things.
If you could paste the command you're using to launch the pipeline, also the Singularity script you're using to mount the relevant directories, that would be helpful.
In the last release, we added options to skip various QC steps. After reading your issue I realise that MarkDuplicates was missed here - I've just fixed that in the dev
branch (see https://github.com/nf-core/rnaseq/pull/83). So, if you use -r dev
when running the pipeline to get the development code (note - not reproducible as it's not a stable release), you can skip this step. See the new documentation. This will of course be properly available the next time we do a release.
Phil
Hi @pditommaso , @ewels
You are correct in assuming that I am using -profile hebbe
to run the command. However, I am also using an additional config_TEMPLATE file where besides stating the directories of my
bed
fasta
starIndex
files, I am specifying the parameters the required memory for markduplicates
process {
executor = 'slurm'
clusterOptions = { "-A $params.project ${params.clusterOptions ?: ''}" }
/* The Hebbe scheduler fails if you try to request an amount of memory for a job */
$markDuplicates.memory = 6.GB
}
This way the scheduler does not crush!
Bootstrap: docker
From: nfcore/rnaseq
%post
mkdir -p /c3se
mkdir -p /local
mkdir -p /apps
mkdir -p /usr/share/lmod/lmod
mkdir -p /var/hasplm
mkdir -p /var/opt/thinlinc
mkdir -p /usr/lib64
touch /usr/lib64/libdlfaker.so
touch /usr/lib64/libvglfaker.so
touch /usr/bin/nvidia-smi
nextflow run nf-core/RNAseq \
-r 1.0 \
-with-singularity "/c3se/NOBACKUP/groups/c3-c3se605-15-5/BARIA_RNASeq/C101HW18060480/Pipeline/nfcore-rnaseq-1.4.simg" \
-c "/c3se/NOBACKUP/groups/c3-c3se605-15-5/BARIA_RNASeq/C101HW18060480/Pipeline/config_TEMPLATE" \
-profile hebbe \
-with-dag flowchart.pdf \
--project C3SE2018-1-20 \
--genome 'GRCh38' \
--reads "/c3se/NOBACKUP/groups/c3-c3se605-15-5/BARIA_RNASeq/C101HW18060480/Pipeline/*{1,2}.fq.gz" \
--outdir "/c3se/NOBACKUP/groups/c3-c3se605-15-5/BARIA_RNASeq/C101HW18060480/Pipeline/results" \
-resume
From the nextflow log I found this:
nxf_mktemp() {
local base=${1:-/tmp}
if [[ $(uname) = Darwin ]]; then mktemp -d $base/nxf.XXXXXXXXXX
else TMPDIR="$base" mktemp -d -t nxf.XXXXXXXXXX
fi
}
and this:
set +u; env - PATH="$PATH" SINGULARITYENV_TMP="$TMP" SINGULARITYENV_TMPDIR="$TMPDIR"
and I have set the $TMPDIR
within the sbatch script to point to a directory that has more than 1TB available space.
Great to know that you have now made it possible to skip MarkDuplicates, by using the dev
branch! I will definitely give it a try
Great! Thanks for the details. I've reformatted what you've written a little so that it was easier for me to read (eg. making the nextflow command multi-line), hope that's ok.
The Hebbe scheduler fails if you try to request an amount of memory for a job
I'm a bit confused by this. You say you need to add this line to stop it from crashing. But in that line, you are requesting an amount of memory for a job, which is the opposite of what the comment line says.
Minor thing: nextflow handles relative paths fine. Not sure if this is not an option for some reason, but this may be easier:
#!/usr/bin/env bash
cd /c3se/NOBACKUP/groups/c3-c3se605-15-5/BARIA_RNASeq/C101HW18060480/Pipeline/
nextflow run nf-core/RNAseq \
-r 1.0 \
-with-singularity nfcore-rnaseq-1.4.simg \
-c config_TEMPLATE \
-profile hebbe \
-with-dag flowchart.pdf \
--project C3SE2018-1-20 \
--genome 'GRCh38' \
--reads "*{1,2}.fq.gz" \
-resume
--outdir
as that would have been the default anywayresults/pipeline_info
folder if you don't specify -with-dag
-w
command line flag.The filename nfcore-rnaseq-1.4.simg
makes me a little suspicious, as the pipeline is version 1.0
now, not 1.4
. But maybe that's just a typo.
and I have set the $TMPDIR within the sbatch script to point to a directory that has more than 1TB available space.
I'm not sure which sbatch script you mean here.. Do you mean the scripts generated by nextflow within the work directory? They are dynamically generated so any edits you use after they are created won't be used in later runs. Or do you mean that you're launching the main nextflow command within an sbatch command?
Thank you for your input @ewels . Just a few things:
Just to make it clear, I am not editing anything in the hebbe config file
. I have added this as a parameter to my config_Template
file.
When I launched the pipeline without specifying anything as a memory requirement the pipeline crashed as well, just with a different exit status (143)
. MarkDuplicates memory was defaulted in 3GB and said that it reached the memory limit and couldn't execute.
So when I added this parameter, at least it was able to run the markduplicates command. Now it exits with exit status (1)
Thanks for your suggestion on handling relative paths. the 1.4
in nfcore-rnaseq-1.4.simg
is indeed a typo.
Or do you mean that you're launching the main nextflow command within an sbatch command? I use an sbatch command to submit my bash script as a job on the cluster, so yes.
Exactly before the execution of the nextflow command I export this:
export _JAVA_OPTIONS=-Djava.io.tmpdir=/path/with/enough/space/
so I switch the default /tmp
directory for Java IO, where Picard has more than enough space to run.
Hi again!
So I tried to run the pipeline from the dev
branch, as you suggested. Before I did that I:
singularity pull docker://nfcore/rnaseq:dev
for the new image-r 1.0
to -r dev
Now the pipeline cannot properly launch. I get an error for trim_galore:
ERROR ~ Error executing process > 'trim_galore (S114J)'
Caused by:
Failed to submit process to grid scheduler for execution
Command executed:
sbatch .command.run
Command exit status:
1
Command output:
sbatch: error: rejecting job, too much memory requested for 2 cores. You must use -C MEMXXX or increase number of requested cores instead of using --mem: pn_min_memory = 65536
sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification
My command for launching the pipeline is the same as before:
cd /mydirectory/
nextflow run nf-core/RNAseq \
-r dev \
-with-singularity "nfcore-rnaseq-1.0.simg" \
-c "config_TEMPLATE" \
-profile hebbe \
-with-dag flowchart.pdf \
--project Cxxxx-1-20 \
--genome 'GRCh38' \
--reads "*{1,2}.fq.gz" \
--skip_dupradar
-resume
Looks like the memory allocation for trim_galore is problematic in this branch.
Please can you paste the contents (or at least the header) of the .command.run
file that's in the work directory for the trim galore task? With all of the #SBATCH
headers.
The hebbe profile should disable any memory requests for this task (see config). So it should be requesting only 2 cores. I updated the syntax in the dev
branch for this for the newer nextflow style, so it could be that this isn't working. I'm not sure what the second sbatch error is (the gres specification).
If the sbatch config is wrong then that's something we can fix in the pipeline. If not, then we may need to contact your sysadmins for help as it sounds system-specific.
Ah - what version of nextflow are you using?
I can see it is allocating 2 cores (1/10th of a node) but I don't get why it would want to allocate the entire node's memory for trim_galore (--mem: pn_min_memory = 65536
). It seemed to work fine under release 1.0
(but that release has another issue, the main topic of this thread).
So I am attaching my command.run file here: command.run.txt
@pditommaso - any ideas what's happening here? Context:
hebbe
profile overwrites the memory requirements here (withName:trim_galore.memory = null
)#SBATCH --mem 65536
Previously, with the older syntax $trim_galore.memory
(here) this seemed to work fine. Any ideas why nextflow is now trying to allocate the entire node's memory for the job? Do I have something wrong with the new syntax?
@demilappa - sorry, I think my follow up question was missed: what version of nextflow are you using?
@ewels Nextflow version:
N E X T F L O W
version 0.31.1 build 4886
last modified 07-08-2018 15:53 UTC (17:53 CEST)
cite doi:10.1038/nbt.3820
http://nextflow.io
Because the syntax is wrong. It should be withName:trim_galore { memory = null }
. See here.
Also you should be able to replace this block with withName: '*' { memory=null }
.
@pditommaso
Thank you for your tip in syntax.
When I added to my config_TEMPLATE
file (not the hebbe conf):
process {
executor = 'slurm'
clusterOptions = { "-A $params.project ${params.clusterOptions ?: ''}" }
/* The Hebbe scheduler fails if you try to request an amount of memory for a job */
memory = null
withName:makeSTARindex { memory = null }
withName:makeHisatSplicesites { memory = null }
withName:makeHISATindex { memory = null }
withName:fastqc { memory = null }
withName:trim_galore { memory = null }
withName:star { memory = null }
withName:hisat2Align { memory = null }
withName:hisat2_sortOutput { memory = null }
withName:rseqc { memory = null }
withName:genebody_coverage { memory = null }
withName:preseq { memory = null }
withName:markDuplicates { memory = null }
withName:dupradar { memory = null }
withName:featureCounts { memory = null }
withName:merge_featureCounts { memory = null }
withName:stringtieFPKM { memory = null }
withName:sample_correlation { memory = null }
withName:multiqc { memory = null }
}
the pipeline was able to continue launching
Always same sbatch
error ?
Ok great, so withName
doesn't work with the dot syntax, noted 👍 PR incoming to fix the hebbe profile on dev
.
@demilappa - great the pipeline is now launching properly! Let's see how far we get in the execution before the next error 😆
Closing this for now - feel free to open a new issue if you have more difficulties :)
Even though the issue was closed, the original question remains:
Even though I have set my
$TMPDIR
directory to have a lot of available space, I still get the error frommarkduplicates
. Whatever this dir is it always saysno space left on device
from thejava.io.IOException
.
https://github.com/nf-core/rnaseq/blob/df3a671b5e825070af4a9bb63bd41da307bf848d/conf/base.config#L16
The default value for cpu allocation is 1/10th. The C3SE docs say that $TMPDIR
is 1600 GB so at 1/10th only 160 GB is allocated. It might be that it's simply not enough in @demilappa's case.
Possibly.. @demilappa are you still getting this error? I mostly closed the issue because it went quiet.. 😀
@ewels I still get this error when running the pipeline.
I only had a successful execution by skipping the markDuplicates
step, this is why I didn't reply. The issue remains and I agree with @mihai-sysbio that it has to do with resource allocation on hebbe
Not sure if withName: '*' { memory=null }.
works
I get this error when starting the pipeline with version 1.1
WARN: The config file defines settings for an unknown process: *
ERROR ~ Dangling meta character '*' near index 0
I updated to nextflow 18.10.1 before trying the 1.1 version of the pipeline
The code is correct, i.e. '*'
is a regular expression; it seems to be a problem with Nextflow https://github.com/nextflow-io/nextflow/issues/905
Hi @demilappa, did you ever manage to get markduplicates to run? I was getting the same errors, but eventually made it work with the following combination of parameters shown below. I am not running the same pipeline as you though, so maybe your problem is specific to this workflow.
java -Xms1g -Xmx32g -XX:ParallelGCThreads=$THREADS -XX:MaxPermSize=1g -XX:+CMSClassUnloadingEnabled \ -jar $MRKDUP \ INPUT=$OUTDIR/${RNAME}_${QNAME}-s.bam \ OUTPUT=$OUTDIR/${RNAME}_${QNAME}-smd.bam \ METRICS_FILE=$OUTDIR/${RNAME}_${QNAME}-smd.metrics \ AS=TRUE \ VALIDATION_STRINGENCY=LENIENT \ MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 \ MAX_RECORDS_IN_RAM=900000000 \ TMP_DIR=$TMPDIR \ REMOVE_DUPLICATES=TRUE
I used the default disk $TMPDIR, i.e. I did not define it myself (when I tried that I used up the entire file quota, oops), and ran the job on the 512 GB RAM node using all 20 cores.
Hope this helps, Francisco
Could you give it another try? We adjusted several things in the latest release that made it more stable with other datasets so hopefully that could also resolve your issues here #179
Hi @apeltzer ! I have actually managed to successfully complete the pipeline's run, with the pipeline's previous release. Apparently the issue had to do with the resource allocation and job distribution by the cluster system. I will also give it a shot with the latest release, too.
Awesome - then I'll close this issue and you just reopen if that is required 👍
Thanks for the feedback 👍
Hi @apeltzer just some more info that might be useful to you on this markDuplicates java memory issue. I had two separate errors.
Had to do with task.memory.toGiga() > 8
- I think that should be < 8
to agree with https://github.com/uct-cbio/RNAseq-pipeline/blob/uct-dev/nextflow.config#L21 ? Otherwise I get No signature of method: _nf_script_67572c92.$() is applicable for argument types: (_nf_script_67572c92$_run_closure45$_closure130$_closure131) values: [_nf_script_67572c92$_run_closure45$_closure130$_closure131@5c0a8de1] Possible solutions: is(java.lang.Object), run(), run(), any(), any(groovy.lang.Closure), use([Ljava.lang.Object;)
I experienced similar java memory issues as reported on this thread, which I think has to do with our specific cluster setup/resource allocation and job distribution as noted by @demilappa
I had to adjust the java max mem to -8GB less than the markDuplicates $task.memory to avoid memory allocation errors. In addition I had to add the option -XX:ParallelGCThreads=${task.cpus}
for it to work on our cluster. The relevant code is here https://github.com/uct-cbio/RNAseq-pipeline/blob/uct-dev/main.nf#L870-L874 and here https://github.com/uct-cbio/RNAseq-pipeline/blob/uct-dev/conf/base.config#L26-L31
Hi @kviljoen !
1.) Hm, actually this line here: https://github.com/nf-core/rnaseq/blob/37f260d360e59df7166cfd60e2b3c9a3999adf75/main.nf#L871
Check whether we have more than 8GB of memory on the system, to set it to the default value in the base.config
which proved to work typically quite fine. If that is lower, it sets it to half the value and the maximum value to the available memory -1 GB in general. What did you use as maximum values for these?
If I look at your code:
markdup_java_options = (task.memory.toGiga() < 8) ? ${params.markdup_java_options} : "\"-Xms" + (task.memory.toGiga() / 2 )+"g "+ "-Xmx" + (task.memory.toGiga() - 8)+ "g\""
Will set the options to the default of -Xms 7000M ...
if the total memory available is 8GB or less, meaning if you have e.g. just 3-4 GB of memory on that system it will likely fail in general :-/
The upper -Xmx
limit is something else to consider, we found the -1G option to be fine, as this will just fail if you specify 1GB or less and in all other cases should work. If you specify ~8GB this will deduct 8GB and probably drop to the default values used by picard...
2.) I think adding the -XX:ParallelGCThreads=${task.cpus}
option is actually a good point which is something for a new PR - do you want to open one adding this?
Hi @apeltzer thanks for your reply! Ah I see now why you did that, for systems with very limited memory? But on our system that line gave an error unfortunately as it limits to < 8GB? My task.memory=32GB and task.cpus=8 and from the pipeline execution report the % requested CPUs was median 82.5 and % requested memory use was 97% The settings I have at the moment won't work for everyone but it just illustrates that the memory specification in certain instances will need some tweaking. Will do a PR for 2).
Yes, the idea was to not fail on systems with less than 8GB of memory. For every system with > 8GB memory, it will anyways use the default values specified in the base.config
. We did some benchmarks and it proved to work most reliable with the settings specified in the base.config
there, so that is where this value comes from 👍
Do you have an idea what you specified for getting this error here?
No signature of method: _nf_script_67572c92.$() is applicable for argument types: (_nf_script_67572c92$_run_closure45$_closure130$_closure131) values: [_nf_script_67572c92$_run_closure45$_closure130$_closure131@5c0a8de1] Possible solutions: is(java.lang.Object), run(), run(), any(), any(groovy.lang.Closure), use([Ljava.lang.Object;)
Hi @apeltzer, Yes so that error I get when I use >8 as in your code, and the error is solved by using < 8.
I think I'm maybe not understanding that line of code?
markdup_java_options = (task.memory.toGiga() > 8) ? ${params.markdup_java_options} : "\"-Xms" + (task.memory.toGiga() / 2 )+"g "+ "-Xmx" + (task.memory.toGiga() - 1)+ "g\""
I thought it was saying: If the task.memory specified in the base.config is > 8GB use the options specified in params.markdup_java_options, else use task.memory - 1 ? In which case it can never specify more than 8GB, bacause default is 7GB?
That is precisely what this does. We double checked in the production pipelines at the GATK teams CWL repository for example and found that even if their processing systems have more than 8GB of memory they always default to the setting specified in the base.config
. We had a lot of weird errors when e.g. specifying 32GB of memory due to much higher consumption of memory in such cases, so limiting this to always default to the value set in base.config
was resolving the issues we had on multiple clusters.
Yes, that means the process can never use more than 8GB, which is fine since it will anyways submit with this information to a cluster scheduler, thus making sure multiple jobs can run on that instance/node anyways.
Hi all!
I am running nfcore/rnaseq on Hebbe Cluster using singularity to pull. The .simg was created to include all the relevant dirs mounted using a recipe.
I have set
NXF_OPTS='-Xms1g -Xmx6g'
in my bash profile and I have also changed the input parameters for markduplicates to$markDuplicates.memory = 6.GB
. This is because only 2 out of 20 cores are allocated by the pipeline and RAM is proportional, and the default 3GB was insufficient - the pipeline was crashing.Even though I have set my
$TMPDIR
directory to have a lot of available space, I still get the error frommarkduplicates
. Whatever this dir is it always saysno space left on device
from thejava.io.IOException
.Any idea how I can get this pipeline to finish even without
markDuplicates
? I can provide all logs and .command files from the workdir.