Closed SalimSoria closed 2 months ago
Hi @SalimSoria,
Can you please share your custom.config
?
Also, can you try nextflow run labsyspharm/mcmicro --in exemplar-001 -c custom.config -profile singularity,GPU
(Note no spaces around the comma.)
And one more question: is it just UnMicst or none of the segmentation containers are using the GPU? You can try running them all in parallel with the following params.yml
:
workflow:
segmentation: [unmicst, mesmer, cellpose]
The thing I would check is whether .command.run
files in the corresponding work/
directories contain the expected singularity
commands (usually with grep singularity work/*/*/.command.run
).
Here is the custom.config
file:
Docker.runoptions = '--cpus 0.000 --gpus all'
Singularity.runOptions = '—C –-nv'
We also tried running the command with -profile singularity,GPU
without the custom.config
file. This actually caused the GPU to be detected, but now we've run into another error. See below for the .command.log
. The server does have cudnn installed, but may not be viewed within the container.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Tue Apr 16 00:03:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla M40 24GB Off | 00000000:82:00.0 Off | 0 |
| N/A 44C P0 59W / 250W | 0MiB / 23040MiB | 95% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
2024-04-16 00:03:35.142294: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2024-04-16 00:03:35.143246: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
/app/UnMicst1-5.py:114: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
bn = tf.nn.leaky_relu(tf.layers.batch_normalization(c00+shortcut, training=UNet2D.tfTraining))
/usr/local/lib/python3.8/dist-packages/keras/legacy_tf_layers/normalization.py:455: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
return layer.apply(inputs, training=training)
/app/UnMicst1-5.py:136: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
lbn = tf.nn.leaky_relu(tf.layers.batch_normalization(
/app/UnMicst1-5.py:139: UserWarning: `tf.layers.dropout` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dropout` instead.
return tf.layers.dropout(lbn, 0.35, training=UNet2D.tfTraining)
/usr/local/lib/python3.8/dist-packages/keras/legacy_tf_layers/core.py:401: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
return layer.apply(inputs, training=training)
/app/UnMicst1-5.py:199: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
tf.layers.batch_normalization(tf.nn.conv2d(cc, luXWeights2, strides=[1, 1, 1, 1], padding='SAME'),
/app/UnMicst1-5.py:220: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
return tf.layers.batch_normalization(
automatically choosing GPU
Using GPU 0
loading data
loading data
loading data
0.34
0.25
Model restored.
Using channel 1
Inference...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1380, in _do_call
return fn(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1363, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1456, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) UNKNOWN: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node downsampling/ld0/Conv2D}}]]
[[Softmax/_123]]
(1) UNKNOWN: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node downsampling/ld0/Conv2D}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/UnMicst1-5.py", line 848, in <module>
PM = np.uint8(255 * UNet2D.singleImageInference(cells, 'accumulate',
File "/app/UnMicst1-5.py", line 704, in singleImageInference
output = UNet2D.Session.run(UNet2D.nn, feed_dict={UNet2D.tfData: batchData, UNet2D.tfTraining: 0})
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 970, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1193, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1373, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1399, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) UNKNOWN: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node downsampling/ld0/Conv2D
(defined at /app/UnMicst1-5.py:102)
]]
[[Softmax/_123]]
(1) UNKNOWN: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node downsampling/ld0/Conv2D
(defined at /app/UnMicst1-5.py:102)
]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node downsampling/ld0/Conv2D:
In[0] placeholders/data (defined at /app/UnMicst1-5.py:80)
In[1] downsampling/ld0/kernelD0/read (defined at /app/UnMicst1-5.py:86)
Operation defined at: (most recent call last)
>>> File "/app/UnMicst1-5.py", line 770, in <module>
>>> UNet2D.singleImageInferenceSetup(modelPath, GPU, args.mean, args.std)
>>>
>>> File "/app/UnMicst1-5.py", line 660, in singleImageInferenceSetup
>>> UNet2D.setupWithHP(hp)
>>>
>>> File "/app/UnMicst1-5.py", line 43, in setupWithHP
>>> UNet2D.setup(hp['imSize'],
>>>
>>> File "/app/UnMicst1-5.py", line 150, in setup
>>> dsX.append(down_samp_layer(dsX[i], i))
>>>
>>> File "/app/UnMicst1-5.py", line 102, in down_samp_layer
>>> c00 = tf.nn.conv2d(data, ldXWeights1, strides=[1, 1, 1, 1], padding='SAME')
>>>
Input Source operations connected to node downsampling/ld0/Conv2D:
In[0] placeholders/data (defined at /app/UnMicst1-5.py:80)
In[1] downsampling/ld0/kernelD0/read (defined at /app/UnMicst1-5.py:86)
Operation defined at: (most recent call last)
>>> File "/app/UnMicst1-5.py", line 770, in <module>
>>> UNet2D.singleImageInferenceSetup(modelPath, GPU, args.mean, args.std)
>>>
>>> File "/app/UnMicst1-5.py", line 660, in singleImageInferenceSetup
>>> UNet2D.setupWithHP(hp)
>>>
>>> File "/app/UnMicst1-5.py", line 43, in setupWithHP
>>> UNet2D.setup(hp['imSize'],
>>>
>>> File "/app/UnMicst1-5.py", line 150, in setup
>>> dsX.append(down_samp_layer(dsX[i], i))
>>>
>>> File "/app/UnMicst1-5.py", line 102, in down_samp_layer
>>> c00 = tf.nn.conv2d(data, ldXWeights1, strides=[1, 1, 1, 1], padding='SAME')
>>>
Original stack trace for 'downsampling/ld0/Conv2D':
File "/app/UnMicst1-5.py", line 770, in <module>
UNet2D.singleImageInferenceSetup(modelPath, GPU, args.mean, args.std)
File "/app/UnMicst1-5.py", line 660, in singleImageInferenceSetup
UNet2D.setupWithHP(hp)
File "/app/UnMicst1-5.py", line 43, in setupWithHP
UNet2D.setup(hp['imSize'],
File "/app/UnMicst1-5.py", line 150, in setup
dsX.append(down_samp_layer(dsX[i], i))
File "/app/UnMicst1-5.py", line 102, in down_samp_layer
c00 = tf.nn.conv2d(data, ldXWeights1, strides=[1, 1, 1, 1], padding='SAME')
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1096, in op_dispatch_handler
return dispatch_target(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/nn_ops.py", line 2431, in conv2d
return gen_nn_ops.conv2d(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 969, in conv2d
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3697, in _create_op_internal
ret = Operation(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 2101, in __init__
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
Here is the .command.run
file using the grep singularity work/*/*/.command.run
command:
set +u; env - PATH="$PATH" ${TMP:+SINGULARITYENV_TMP="$TMP"} ${TMPDIR:+SINGULARITYENV_TMPDIR="$TMPDIR"} ${NXF_TASK_WORKDIR:+SINGULARITYENV_NXF_TASK_WORKDIR="$NXF_TASK_WORKDIR"} singularity exec --no-home --pid -B /localtmp/test2/work -C -H "$PWD" --nv /localtmp/test2/work/singularity/labsyspharm-unmicst-2.7.7.img /bin/bash -ue /localtmp/test2/work/5f/2e7acc7242e8aa232b177fa56be9da/.command.sh
I haven't had the time to try all three segmentation options. I'll get back to you on that when I'm done.
By the way, is there a way to limit the number of CPU cores used during the run if we were to not run it with the GPU?
Hi @SalimSoria,
I think the reason it wasn't finding your GPU with the custom.config
is because of capitalization issues. singularity
and docker
should be all lowercase, while runOptions
should have the O capitalized.
The GPU
config profile does effectively the same thing, and based on what you shared, it looks like the GPU is now visible inside the container.
Unfortunately, the cuDNN errors are notoriously hard to debug. @clarenceyapp can chime in here, but the two most-common issues are 1) driver incompatibility, and 2) out-of-GPU-memory issues. Your driver version 550.54.15
is very new, while UnMicst is still based on a fairly old version of TensorFlow, so there could be some incompatibilities there. I don't suspect memory issues, simply because exemplar-001 is tiny and UnMicst already implements the standard suggestion for these types of errors.
I am curious to see whether you have similar issues with Mesmer and Cellpose, but if we were to debug UnMicst, the next step would be to launch the TensorFlow container that UnMicst is based on in an interactive session with:
singularity shell -C --nv docker://tensorflow/tensorflow:2.7.1-gpu
then once inside the container start a python
shell and interactively type import tensorflow.compat.v1 as tf
, followed by these commands: https://github.com/HMS-IDAC/UnMicst/blob/master/UnMicst1-5.py#L434-L438 to see if you can reproduce the cuDNN issue. From there, we would either try newer tensorflow container to identify whether it's a version compatibility issue or additional GPU config options to rule out out-of-memory issues.
To limit CPUs, it should just be a matter of adding --cpus 4
to singularity.runOptions
.
What would be the best driver version to have for UnMicst? We have no problem changing to an older driver version if need be.
We followed the instructions to test UNMicst.
# singularity shell -C --nv docker://tensorflow/tensorflow:2.7.1-gpu
...
Singularity> python
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>> saver = tf.train.Saver()
WARNING:tensorflow:Saver is deprecated, please switch to tf.train.Checkpoint or tf.keras.Model.save_weights for training checkpoints. When executing eagerly variables do not necessarily have unique names, and so the variable.name-based lookups Saver performs are error-prone.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 899, in __init__
raise RuntimeError(
RuntimeError: When eager execution is enabled, `var_list` must specify a list or dict of variables to save
>>> config = tf.ConfigProto()
>>> config.gpu_options.allow_growth = True
>>> config.allow_soft_placement = True
>>> sess = tf.Session(config=config)
2024-04-18 00:03:57.731050: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-18 00:04:04.536372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22209 MB memory: -> device: 0, name: Tesla M40 24GB, pci bus id: 0000:82:00.0, compute capability: 5.2
>>>
^ @clarenceyapp This is probably a question for you.
HI @SalimSoria , There appears to be reports of incompatibility between tensorflow 2.7 and CUDA 12. I'm using CUDA 11.3.1. Also, just to confirm, cuDNN needs to be installed (I'm using version 8.2.1). Please let us know if that works.
This link refers to an old article but is relevant. It suggests using slightly older CUDA versions even. Anything from CUDA version 11 should work.
The bioinformatics department looked at the installed deps in labsyspharm-unmicst-2.7.7.img docker image and listed them as:
CUDA: 11.0 and 11.2 are installed but 11.2 is default cuDNN: libcudnn.so.8.1.0 tensorflow: 2.7.1
They then installed cuda-11.2 on the host so that it could install the matching kernel driver (Kernel Driver Version 460.27.04). It seems this may have fixed the issue since the GPU was used (based on the log), but the cuDNN error was still present. They also mentioned the CPU pegged during the unmicst step, and we were wondering if this step is CPU-bound even when using the GPU. Nonetheless, this run finished using both the CPU and GPU, and not really improving the time to completion
Here is the command log for the unmicst step:
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Tue Apr 23 20:35:28 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 00000000:82:00.0 Off | 0 |
| N/A 42C P0 60W / 250W | 0MiB / 22945MiB | 97% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2024-04-23 20:35:32.789820: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
/app/UnMicst1-5.py:114: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
bn = tf.nn.leaky_relu(tf.layers.batch_normalization(c00+shortcut, training=UNet2D.tfTraining))
/usr/local/lib/python3.8/dist-packages/keras/legacy_tf_layers/normalization.py:455: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
return layer.apply(inputs, training=training)
/app/UnMicst1-5.py:136: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
lbn = tf.nn.leaky_relu(tf.layers.batch_normalization(
/app/UnMicst1-5.py:139: UserWarning: `tf.layers.dropout` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dropout` instead.
return tf.layers.dropout(lbn, 0.35, training=UNet2D.tfTraining)
/usr/local/lib/python3.8/dist-packages/keras/legacy_tf_layers/core.py:401: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
return layer.apply(inputs, training=training)
/app/UnMicst1-5.py:199: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
tf.layers.batch_normalization(tf.nn.conv2d(cc, luXWeights2, strides=[1, 1, 1, 1], padding='SAME'),
/app/UnMicst1-5.py:220: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf.keras.layers.BatchNormalization` instead. In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.BatchNormalization` documentation).
return tf.layers.batch_normalization(
automatically choosing GPU
Using GPU 0
loading data
loading data
loading data
0.34
0.25
Model restored.
Using channel 1
Inference...
Inference...
Inference...
Hi @SalimSoria , I think that is looking better than before. With the exception of the failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
message, all other warning messages are expected when a GPU has been found.
There is an optional image resizing step before image inference that might be CPU heavy but other than that, CPU loads should be small.
Can you let me know:
Hi @clarenceyapp
Sorry I didn't mention before. (1) The run I showed the command log for is using exemplar-002. (2) This run took approximately 16 minutes with the default params.yml (with the exception of -profile singularity,GPU
). Before it had successfully detected the GPU, it took about 19 minutes to complete.
I'll check to see how much quicker running the UnMICST step with exemplar-001 is between using the GPU vs CPU.
@ArtemSokolov
Sorry, could you explain setting singularity.runOptions
to --cpus 4? Is this something that can be set using the command singularity run [run options...] <container>
? or can I set this within a custom params.yml
?
Oh yea, sorry, this goes inside a custom.config
:
singularity.runOptions = '-C -H "$PWD" --nv --cpus 4'
which you can supply to the pipeline with
nextflow run labsyspharm/mcmicro --in exemplar-001 -profile singularity,GPU -c custom.config
I realize that's params.yml
vs. custom.config
is a potential source of confusion. A good rule of thumb for Nextflow pipelines is:
params.yml
controls the pipeline behavior (which modules to run, what parameters to pass to each, what versions of containers to pull, etc.)custom.config
controls how the pipeline is executed on any given infrastructure (whether or not to expose GPUs, how many CPUs/memory to allocate to each process, whether to trigger automatic restarts, etc.)@ArtemSokolov I tried setting this into the custom.config
and got this error Error for command "exec": unknown flag: --cpus
followed with a list of possible arguments that don't include --cpus
. However, I did see the option for --vm-cpu
.
Here is the command.log
$ nextflow run labsyspharm/mcmicro --in ./exemplar-001 -c custom.config -profile singularity,GPU
N E X T F L O W ~ version 23.10.1
Launching `https://github.com/labsyspharm/mcmicro` [fabulous_aryabhata] DSL2 - revision: 69ee2efe21 [master]
executor > local (1)
[- ] process > illumination -
[1b/cac782] process > registration:ashlar (1) [ 0%] 0 of 1
[- ] process > background:backsub -
[- ] process > dearray:coreograph -
[- ] process > dearray:roadie:runTask -
[- ] process > segmentation:roadie:runTask -
[- ] process > segmentation:worker -
[- ] process > segmentation:s3seg -
[- ] process > quantification:mcquant -
[- ] process > downstream:worker -
[- ] process > viz:autominerva -
ERROR ~ Error executing process > 'registration:ashlar (1)'
Caused by:
Process `registration:ashlar (1)` terminated with an error exit status (1)
Command executed:
ashlar 'exemplar-001-cycle-06.ome.tiff' 'exemplar-001-cycle-07.ome.tiff' 'exemplar-001-cycle-08.ome.tiff' -m 30 --ffexecutor > local (1)
[- ] process > illumination -
[1b/cac782] process > registration:ashlar (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > background:backsub -
[- ] process > dearray:coreograph -
[- ] process > dearray:roadie:runTask -
[- ] process > segmentation:roadie:runTask -
[- ] process > segmentation:worker -
[- ] process > segmentation:s3seg -
[- ] process > quantification:mcquant -
[- ] process > downstream:worker -
[- ] process > viz:autominerva -
ERROR ~ Error executing process > 'registration:ashlar (1)'
Caused by:
Process `registration:ashlar (1)` terminated with an error exit status (1)
Command executed:
ashlar 'exemplar-001-cycle-06.ome.tiff' 'exemplar-001-cycle-07.ome.tiff' 'exemplar-001-cycle-08.ome.tiff' -m 30 --ffp exemplar-001-cycle-06-ffp.tif exemplar-001-cycle-07-ffp.tif exemplar-001-cycle-08-ffp.tif --dfp exemplar-001-cycle-06-dfp.tif exemplar-001-cycle-07-dfp.tif exemplar-001-cycle-08-dfp.tif -o exemplar-001.ome.tif
Command exit status:
1
Command output:
(empty)
Command error:
--no-home do NOT mount users home directory if home
is not the current working directory
--no-init do NOT start shim process with --pid
--no-nv
--no-privs drop all privileges from root user in container)
--nohttps do NOT use HTTPS with the docker://
transport (useful for local docker
registries without a certificate)
--nonet disable VM network handling
--nv enable experimental Nvidia support
-o, --overlay strings use an overlayFS image for persistent data
storage or as read-only layer of container
--passphrase prompt for an encryption passphrase
--pem-path string enter an path to a PEM formated RSA key for
an encrypted container
-p, --pid run container in a new PID namespace
--pwd string initial working directory for payload
process inside the container
--rocm enable experimental Rocm support
-S, --scratch strings include a scratch directory within the
container that is linked to a temporary dir
(use -W to force location)
--security strings enable security features (SELinux,
Apparmor, Seccomp)
-u, --userns run container in a new user namespace,
allowing Singularity to run completely
unprivileged on recent kernels. This
disables some features of Singularity, for
example it only works with sandbox images.
--uts run container in a new UTS namespace
--vm enable VM support
--vm-cpu string number of CPU cores to allocate to Virtual
Machine (implies --vm) (default "1")
--vm-err enable attaching stderr from VM
--vm-ip string IP Address to assign for container usage.
Defaults to DHCP within bridge network.
(default "dhcp")
--vm-ram string amount of RAM in MiB to allocate to Virtual
Machine (implies --vm) (default "1024")
-W, --workdir string working directory to be used for /tmp,
/var/tmp and $HOME (if -c/--contain was
also used)
-w, --writable by default all Singularity containers are
available as read only. This option makes
the file system accessible as read/write.
--writable-tmpfs makes the file system accessible as
read-write with non persistent data (with
overlay support only)
Run 'singularity exec --help' for more detailed usage information.
You may need to chase down the best way to do it for your Singularity distribution. It seems that --cpus
option may only be available in the CE (community edition) distribution: https://docs.sylabs.io/guides/main/user-guide/cgroups.html#cpu-limits
Looks like the apptainer distribution (which is likely what you have) uses a completely different configuration method: https://apptainer.org/user-docs/master/cgroups.html#limiting-container-resources-with-cgroups
The challenge with this method is making cgroups.toml
visible inside the container. One option is to put cgroups.toml
in a fixed location on your system and then mount that location to every container. So, something like this:
singularity.runOptions = '-C -H "$PWD" --nv -B /path/to/fixed/loc --apply-cgroups /path/to/fixed/loc/cgroups.toml'
--vm-cpu
doesn't sound right to me, because it says the default is "1", but it sounds like your runs are using more than that? But maybe it's worth a try to see if this option works.
You can also try telling Nextflow that you only want to use 4 CPUs for each process, and see if it can figure out what to do. This is also done in the custom.config
:
process.cpus = 4
singularity.runOptions = '-C -H "$PWD" --nv'
We found a workaround using cgroups. We were able to complete both exemplar-001 and exemplar-002 while also limiting the number of available CPU cores.
However, I may need to revisit this ticket in the future since there are plans to upgrade some of the server's components including its GPU. Thanks again!
My next issue is getting MCMICRO to work on my images which I'll submit a ticket for.
Great to hear you got it working. I will close the issue for now, but feel free to reopen if/when you have follow up issues.
The bioinformatic department and myself are working on setting up a MCMICRO pipeline using a remote linux server. We eventually got both tutorial images (exemplar-001 and exemplar-002) to finish the pipeline in less than the expect amount of time. However, we want to try and run the pipeline using the server's GPU for the unmicst step since we suspect faster results.
We tried creating a
custom.config
file (as per: https://github.com/labsyspharm/mcmicro/issues/354) and passing the following command line:nextflow run labsyspharm/mcmicro --in exemplar-001 -c custom.config -profile singularity
Although the run was successful, the command log showed the GPU was never found and, again, resorted to using the CPU + RAM. We've also tried manually adding the
--nv
argument in `singularity.config' but it seems it ignores it. We've also ensured that nvidia.smi can be accessed from the container.The server is using a Tesla M40 24GB running Cuda 11.0
Here is the
command.log