robbert-harms / MDT

Microstructure Diffusion Toolbox
GNU Lesser General Public License v3.0
48 stars 18 forks source link

pyopencl.cffi_cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - #53

Closed stillill closed 11 months ago

stillill commented 1 year ago

Hello,

I am running MDT out of a Singularity container and am getting a runtime error when running the mdt-model-fit command. Is there anyway to figure out what the problem is?

Singularity> mdt-create-protocol 119576/V12-CVD/*.bvec 119576/V12-CVD/*.bval                                                 
Singularity> mdt-model-fit NODDI 119576/V12-CVD/sub-119576_ses-V12-CVD_dwi.nii.gz 119576/V12-CVD/sub-119576_ses-V12-CVD_dwi.prtcl 119576/V12-CVD/*brainmask.nii.gz
[2023-09-05 12:14:06,835] [INFO] [mdt.lib.processing.model_fitting] [get_model_fit] - Starting intermediate optimization for generating initialization point.
[2023-09-05 12:14:06,952] [INFO] [mdt.lib.processing.model_fitting] [fit_composite_model] - Using MDT version 1.2.6
[2023-09-05 12:14:06,953] [INFO] [mdt.lib.processing.model_fitting] [fit_composite_model] - Preparing for model BallStick_r1
[2023-09-05 12:14:07,390] [INFO] [mdt.models.composite] [_prepare_input_data] - No volume options to apply, using all 197 volumes.
[2023-09-05 12:14:07,390] [INFO] [mdt.utils] [estimate_noise_std] - Trying to estimate a noise std.
[2023-09-05 12:14:07,484] [INFO] [mdt.utils] [estimate_noise_std] - Estimated global noise std 267.0830932932669.
[2023-09-05 12:14:07,485] [INFO] [mdt.lib.processing.model_fitting] [_model_fit_logging] - Fitting BallStick_r1 model
[2023-09-05 12:14:07,485] [INFO] [mdt.lib.processing.model_fitting] [_model_fit_logging] - The 4 parameters we will fit are: ['S0.s0', 'w_stick0.w', 'Stick0.theta', 'Stick0.phi']
[2023-09-05 12:14:07,485] [INFO] [mdt.lib.processing.model_fitting] [fit_composite_model] - Saving temporary results in 119576/V12-CVD/output/sub-119576_ses-V12-CVD_brainmask/BallStick_r1/tmp_results.
[2023-09-05 12:14:07,665] [INFO] [mdt.lib.processing.processing_strategies] [_process_chunk] - Computations are at 0.00%, processing next 100000 voxels (334452 voxels in total, 0 processed). Time spent: 0:00:00:00, time left: ? (d:h:m:s).
[2023-09-05 12:14:07,666] [INFO] [mdt.lib.processing.model_fitting] [_process] - Starting optimization
[2023-09-05 12:14:07,666] [INFO] [mdt.lib.processing.model_fitting] [_process] - Using MOT version 0.11.3
[2023-09-05 12:14:07,666] [INFO] [mdt.lib.processing.model_fitting] [_process] - We will use a single precision float type for the calculations.
[2023-09-05 12:14:07,666] [INFO] [mdt.lib.processing.model_fitting] [_process] - Using device 'GPU - Tesla V100-SXM2-16GB (NVIDIA CUDA)'.
[2023-09-05 12:14:07,666] [INFO] [mdt.lib.processing.model_fitting] [_process] - Using compile flags: ('-cl-denorms-are-zero', '-cl-mad-enable', '-cl-no-signed-zeros')
[2023-09-05 12:14:07,666] [INFO] [mdt.lib.processing.model_fitting] [_process] - We will use the optimizer Powell with default settings.
Traceback (most recent call last):
  File "/usr/bin/mdt-model-fit", line 11, in <module>
    load_entry_point('mdt==1.2.6', 'console_scripts', 'mdt-model-fit')()
  File "/usr/lib/python3/dist-packages/mdt/lib/shell_utils.py", line 47, in console_script
    cls().start(sys.argv[1:])
  File "/usr/lib/python3/dist-packages/mdt/lib/shell_utils.py", line 66, in start
    self.run(args, {})
  File "/usr/lib/python3/dist-packages/mdt/cli_scripts/mdt_model_fit.py", line 161, in run
    fit_model()
  File "/usr/lib/python3/dist-packages/mdt/cli_scripts/mdt_model_fit.py", line 155, in fit_model
    use_cascaded_inits=args.use_cascaded_inits)
  File "/usr/lib/python3/dist-packages/mdt/__init__.py", line 191, in fit_model
    double_precision=double_precision)
  File "/usr/lib/python3/dist-packages/mdt/__init__.py", line 99, in get_optimization_inits
    double_precision=double_precision)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/model_fitting.py", line 177, in get_optimization_inits
    return get_init_data(model_name)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/model_fitting.py", line 115, in get_init_data
    fit_results = get_model_fit('BallStick_r1')
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/model_fitting.py", line 86, in get_model_fit
    cl_device_ind=cl_device_ind, initialization_data={'inits': inits})
  File "/usr/lib/python3/dist-packages/mdt/__init__.py", line 206, in fit_model
    optimizer_options=optimizer_options)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/model_fitting.py", line 306, in fit_composite_model
    return processing_strategy.process(worker)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/processing_strategies.py", line 106, in process
    self._process_chunk(processor, chunks)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/processing_strategies.py", line 151, in _process_chunk
    process()
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/processing_strategies.py", line 148, in process
    processor.process(chunk, next_indices=next_chunk)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/processing_strategies.py", line 275, in process
    self._process(roi_indices, next_indices=next_indices)
  File "/usr/lib/python3/dist-packages/mdt/lib/processing/model_fitting.py", line 371, in _process
    x0 = self._codec.encode(self._initial_params[roi_indices], kernel_data_subset)
  File "/usr/lib/python3/dist-packages/mdt/model_building/utils.py", line 220, in encode
    parameters, kernel_data, cl_runtime_info=cl_runtime_info)
  File "/usr/lib/python3/dist-packages/mdt/model_building/utils.py", line 257, in _transform_parameters
    cl_named_func.evaluate(kernel_data, parameters.shape[0], cl_runtime_info=cl_runtime_info)
  File "/usr/lib/python3/dist-packages/mot/lib/cl_function.py", line 356, in evaluate
    kernels = get_kernels(kernel_source, cl_function.get_cl_function_name())
  File "/usr/lib/python3/dist-packages/mot/lib/cl_function.py", line 350, in get_kernels
    env.context, kernel_source).build(' '.join(cl_runtime_info.compile_flags))
  File "/usr/lib/python3/dist-packages/pyopencl/__init__.py", line 462, in build
    options_bytes=options_bytes, source=self._source)
  File "/usr/lib/python3/dist-packages/pyopencl/__init__.py", line 506, in _build_and_catch_errors
    raise err
pyopencl.cffi_cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE -

Build on <pyopencl.Device 'Tesla V100-SXM2-16GB' on 'NVIDIA CUDA' at 0x1f815a0>:

(options: -cl-denorms-are-zero -cl-mad-enable -cl-no-signed-zeros -I /usr/lib/python3/dist-packages/pyopencl/cl)
(source saved as /tmp/tmpqg6b71u6.cl)

Thanks!

robbert-harms commented 12 months ago

Hi Stillill,

Thank you for your inquiry. I can see in the log that the error is due to a compilation problem. (pyopencl.cffi_cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE -). This is unfortunately not something I can influence. In particular NVidia has unstable OpenCL drivers and compilers.

You could try installing the latest nvidia drivers for your system in the hope that the bug has been resolved in the latest version. On the MDT side unfortunately not much can be done.

Best,

Robbert

stillill commented 12 months ago

Hi Robbert,

Thanks for getting back to me about this! No worries if there isn't anything you think you can do here. I've been testing things out more and wanted to add that MDT works fine for me in a Docker container which I created using your Docker.nvidia recipe file. It just doesn't to work on a GPU in a Singularity (now Apptainer) container. Even when I create the Apptainer image by pulling from my working MDT Docker image. The CPU version of MDT works fine for me in Apptainer though. I also can't run the hello world demo.py code, provided by PyOpenCL, using a GPU in Apptainer but again that works fine in Docker. I can run demo.py on the CPU using Apptainer. I ended up posting a message to the Apptainer mailing list to see if anyone had ideas as to why MDT would work fine in Docker but not Apptainer and someone is helping me look into this. One question that came up on the Apptainer mailing list is if the MDT app attempts to write to the container. This would be a problem since Apptainer containers are not writable.

Thanks again!

robbert-harms commented 11 months ago

Hi Stillill,

About your question here: "One question that came up on the Apptainer mailing list is if the MDT app attempts to write to the container. This would be a problem since Apptainer containers are not writable."

To function correctly, MDT requires a few files in your home directory (config files and some model files). At start-up it will try to write these files to your home directory if missing. Perhaps this is what causing you the problems?

Best,

Robbert

stillill commented 11 months ago

Hello,

Thanks for the info! Someone on the Apptainer mailing list finally figured out the problem with the container. It turns out the libnvidia-nvvm.so.4 library was not in the container environment. They suggested adding this library to Apptainer' s nvliblist.conf file and that resolved the error.