gschramm / pyapetnet

a CNN for anatomy-guided deconvolution and denoising of PET images
https://gschramm.github.io/pyapetnet/
MIT License
13 stars 4 forks source link

Issue with running demo data #5

Closed cwying closed 1 year ago

cwying commented 1 year ago

Hello!

I just installed pyapetnet on our server following the official instruction, and confirms that it is successful. However, when I tried the demo data (both DICOM and NIFTI), I got the following error. I wonder whether you have any suggestions on how to solve this issue? Thank you!

Traceback (most recent call last): File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/bin/pyapetnet_predict_from_dicom", line 8, in sys.exit(main()) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/pyapetnet/predict_from_dicom.py", line 168, in main pred = model.predict(x).squeeze() File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'functional_3/functional_1/batchnorm_ind_0_1/FusedBatchNormV3_2' defined at (most recent call last): File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/bin/pyapetnet_predict_from_dicom", line 8, in sys.exit(main()) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/pyapetnet/predict_from_dicom.py", line 168, in main pred = model.predict(x).squeeze() File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 2350, in predict tmp_batch_outputs = self.predict_function(iterator) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 2137, in predict_function return step_function(self, iterator) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 2123, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 2111, in run_step outputs = model.predict_step(data) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 2079, in predict_step return self(x, training=False) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, *kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 561, in call return super().call(args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in call outputs = call_fn(inputs, *args, *kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler return fn(args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call return self._run_internal_graph(inputs, training=training, mask=mask) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph outputs = node.layer(*args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, *kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/training.py", line 561, in call return super().call(args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in call outputs = call_fn(inputs, *args, *kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler return fn(args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call return self._run_internal_graph(inputs, training=training, mask=mask) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph outputs = node.layer(*args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, *kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in call outputs = call_fn(inputs, args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler return fn(*args, kwargs) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/layers/normalization/batch_normalization.py", line 866, in call outputs = self._fused_batch_norm(inputs, training=training) File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/layers/normalization/batch_normalization.py", line 659, in _fused_batch_norm output, mean, variance = control_flow_util.smart_cond( File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/utils/control_flow_util.py", line 108, in smart_cond return tf.internal.smart_cond.smart_cond( File "/bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/layers/normalization/batch_normalization.py", line 648, in _fused_batch_norm_inference return tf.compat.v1.nn.fused_batch_norm( Node: 'functional_3/functional_1/batchnorm_ind_0_1/FusedBatchNormV3_2' scale must have the same number of elements as the channels of x, got 15 and 1 [[{{node functional_3/functional_1/batchnorm_ind_0_1/FusedBatchNormV3_2}}]] [Op:__inference_predict_function_14243]**

gschramm commented 1 year ago

Hi, thanks for reporting this issue. Can you report the following to debug the issue:

cwying commented 1 year ago

Hi Georg,

Thank you for your prompt response! Please see the information below:

Name Version Build Channel _libgcc_mutex 0.1 main _openmp_mutex 5.1 1_gnu absl-py 1.4.0 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi bzip2 1.0.8 h7b6447c_0 ca-certificates 2023.01.10 h06a4308_0 cachetools 5.3.0 pypi_0 pypi certifi 2022.12.7 py310h06a4308_0 charset-normalizer 3.0.1 pypi_0 pypi contourpy 1.0.7 pypi_0 pypi cycler 0.11.0 pypi_0 pypi flatbuffers 23.1.21 pypi_0 pypi fonttools 4.38.0 pypi_0 pypi gast 0.4.0 pypi_0 pypi google-auth 2.16.0 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.51.1 pypi_0 pypi h5py 3.8.0 pypi_0 pypi idna 3.4 pypi_0 pypi imageio 2.25.0 pypi_0 pypi keras 2.11.0 pypi_0 pypi kiwisolver 1.4.4 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1 libclang 15.0.6.1 pypi_0 pypi libffi 3.4.2 h6a678d5_6 libgcc-ng 11.2.0 h1234567_1 libgomp 11.2.0 h1234567_1 libstdcxx-ng 11.2.0 h1234567_1 libuuid 1.41.5 h5eee18b_0 llvmlite 0.39.1 pypi_0 pypi markdown 3.4.1 pypi_0 pypi markupsafe 2.1.2 pypi_0 pypi matplotlib 3.6.3 pypi_0 pypi ncurses 6.3 h5eee18b_3 networkx 3.0 pypi_0 pypi nibabel 5.0.0 pypi_0 pypi numba 0.56.4 pypi_0 pypi numpy 1.23.5 pypi_0 pypi oauthlib 3.2.2 pypi_0 pypi openssl 1.1.1s h7f8727e_0 opt-einsum 3.3.0 pypi_0 pypi packaging 23.0 pypi_0 pypi pillow 9.4.0 pypi_0 pypi pip 22.3.1 py310h06a4308_0 protobuf 3.19.6 pypi_0 pypi pyapetnet 1.5.1 pypi_0 pypi pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pydicom 2.3.1 pypi_0 pypi pymirc 0.28 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi python 3.10.9 h7a1cb2a_0 python-dateutil 2.8.2 pypi_0 pypi pywavelets 1.4.1 pypi_0 pypi readline 8.2 h5eee18b_0 requests 2.28.2 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rsa 4.9 pypi_0 pypi scikit-image 0.19.3 pypi_0 pypi scipy 1.10.0 pypi_0 pypi setuptools 65.6.3 py310h06a4308_0 simpleitk 2.2.1 pypi_0 pypi six 1.16.0 pypi_0 pypi sqlite 3.40.1 h5082296_0 tensorboard 2.11.2 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.1 pypi_0 pypi tensorflow 2.11.0 pypi_0 pypi tensorflow-estimator 2.11.0 pypi_0 pypi tensorflow-io-gcs-filesystem 0.30.0 pypi_0 pypi termcolor 2.2.0 pypi_0 pypi tifffile 2023.1.23.1 pypi_0 pypi tk 8.6.12 h1ccaba5_0 typing-extensions 4.4.0 pypi_0 pypi tzdata 2022g h04d1e81_0 urllib3 1.26.14 pypi_0 pypi werkzeug 2.2.2 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0 wrapt 1.14.1 pypi_0 pypi xz 5.2.10 h5eee18b_1 zlib 1.2.13 h5eee18b_0

2023-01-25 13:33:22.396931: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-01-25 13:33:22.523543: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-01-25 13:33:26.200905: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 2023-01-25 13:33:26.201042: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 2023-01-25 13:33:26.201060: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:SavedModel saved prior to TF 2.5 detected when loading Keras model. Please ensure that you are saving the model with model.save() or tf.keras.models.save_model(), NOT tf.saved_model.save(). To confirm, there should be a file named "keras_metadata.pb" in the SavedModel directory. 2023-01-25 13:33:33.319327: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 2023-01-25 13:33:33.319363: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2023-01-25 13:33:33.319929: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/keras/layers/core/lambda_layer.py:324: UserWarning: tensorflow.python.keras.utils.multi_gpu_utils is not loaded, but a Lambda layer uses it. It may cause errors. function = cls._parse_function_from_config( /bmr207/nmrgrp/nmr175/.conda/envs/pyapetnet/lib/python3.10/site-packages/pymirc/fileio/read_dicom.py:134: UserWarning: Cannot find SeriesType in first dicom header. Setting it to ('STATIC', 'IMAGE') warnings.warn(f'Cannot find SeriesType in first dicom header. Setting it to {fallback_series_type}')

On a separate note, is there any required/recommended CUDA version for this tool? My understanding is that the prediction part use CPU so it does not matter much, but we are planning to try to retrain this model with additional data.

Thank you!

gschramm commented 1 year ago
  1. I just tried myself on my local Ubuntu 20.04 system with CUDA. Installation and also the test predictions worked :/ (same tensorflow version 2.11 got installed.)
  2. From the error messages that you posted, I think your tensorflow installtion is not happy with your CUDA installation (It complains about many missing cuda libraries). Not sure though if this is the issue.

To test if it is indeed related to problems with tensorflow + CUDA, can you:

  1. remove your pyapetnet conda env again: conda remove -n pyapetnet --all
  2. create a new empty env: conda create pyapetpnet + conda activate pyapetnet
  3. Install tensorflow following the official guide here: https://www.tensorflow.org/install/pip
  4. Install pyapetnet: pip install pyapetnet

In the official guide, the matching cuda-toolkit version (11.2) and cudnn version (8.1) get installed in your conda env.

You can find the required CUDA version for tensorflow here: https://www.tensorflow.org/install/source#tested_build_configurations

The install is a bit messy, since in contrast to pytorch, tensorflow did not manage to put their libs on conda-forge

cwying commented 1 year ago

Thank you! I will test it and let you know. May I ask what CUDA driver version you are current using? (if checking with nvidia-smi).

gschramm commented 1 year ago

In my test I used CUDA version 12.0 and driver version 525.78.01. So probably not optimal (maybe I should also stick to the official tensorflow install instructions)

cwying commented 1 year ago

Thank you so much for the information!

cwying commented 1 year ago

Hi Georg,

I tried to install tf first following the official guide, but it still did not work. I then tried to install tf via conda despite not being recommended (conda install -c conda-forge tensorflow-gpu) first, then installed pyapetnet via pip, and it works! I am not sure why. Maybe conda just solves the compatibility issues better...

Anyway, I have one more question: how long does it take for you to run prediction on one demo case? It took me >1hr on a server with A100 GPU, which does not make sense to me.

Thanks again for your help!

gschramm commented 1 year ago

(1) Glad that at least the installation works. Prediction of the demo case (even without using any GPU) should be less than one minute. If it takes >1h then there is sth severely wrong (but I have no clue what goes wrong). If you "only" want to do predictions, you don't need any GPU. Can you maybe try to install tensorflow without GPU?

(2) Another option: Would it be an option for you to run the prediction from a docker container? We have a docker file here: https://github.com/gschramm/pyapetnet/blob/master/Dockerfile

cwying commented 1 year ago

Thank you so much for the information! We will try the options you recommended. Hopefully it will work as expected soon! I will close this issue for now.