nrbennet / dl_binder_design

MIT License
203 stars 49 forks source link

Segementation fault: pyrosetta & tensorflow #50

Closed charbj closed 10 months ago

charbj commented 10 months ago

Hi all,

You weren't kidding about the tensorflow/GPU issues... it's a pain.

I have tried many combinations. I can install tensorflow and have it recognise the GPU, however importing tensorflow and then pyrosetta causes an immediate segmentation fault. Alternatively, importing pyrosetta and then tensorflow causes a floating point exception. I have tracked the issue down to those two imports. It seems agnostic to python 3.8, 3.9, 3.10 and 3.11.

A minimally working example could probably be made by installing pyrosetta and tensorflow.

mamba create -n error_env
mamba install pyrosetta
python -m pip install "tensorflow[and-cuda"
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) 
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
2023-11-07 14:50:01.994451: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-07 14:50:02.022545: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 14:50:02.022586: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-07 14:50:02.022607: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-07 14:50:02.028142: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
>>> import pyrosetta
Segmentation fault (core dumped)

(Note: I have tried export TF_ENABLE_ONEDNN_OPTS=0, however this does nothing).

Reversing the imports:

>>> import pyrosetta
>>> import tensorflow.compat.v1 as tf
2023-11-07 14:51:10.878175: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-07 14:51:10.909619: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 14:51:10.909652: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-07 14:51:10.909675: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Floating point exception (core dumped)

Environment:

# packages in environment at /usr/local/programs/miniconda/envs/dl_binder_design:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   2.0.0              pyhd8ed1ab_0    conda-forge
astunparse                1.6.3                    pypi_0    pypi
biopython                 1.81             py39hd1e30aa_1    conda-forge
blas                      2.16                        mkl    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
ca-certificates           2023.7.22            hbcca054_0    conda-forge
cachetools                5.3.2                    pypi_0    pypi
certifi                   2023.7.22                pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
contextlib2               21.6.0             pyhd8ed1ab_0    conda-forge
cudatoolkit               11.1.74              h6bb024c_0    nvidia
dm-haiku                  0.0.5                    pypi_0    pypi
dm-tree                   0.1.6                    pypi_0    pypi
flatbuffers               23.5.26                  pypi_0    pypi
gast                      0.5.4                    pypi_0    pypi
google-auth               2.23.4                   pypi_0    pypi
google-auth-oauthlib      1.0.0                    pypi_0    pypi
google-pasta              0.2.0                    pypi_0    pypi
idna                      3.4                      pypi_0    pypi
importlib-metadata        6.8.0                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46305  
jax                       0.2.19                   pypi_0    pypi
jaxlib                    0.1.70+cuda111           pypi_0    pypi
jmp                       0.0.4                    pypi_0    pypi
keras                     2.14.0                   pypi_0    pypi
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libblas                   3.8.0                    16_mkl    conda-forge
libcblas                  3.8.0                    16_mkl    conda-forge
libclang                  16.0.6                   pypi_0    pypi
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_2    conda-forge
libgfortran-ng            7.5.0               h14aa051_20    conda-forge
libgfortran4              7.5.0               h14aa051_20    conda-forge
libgomp                   13.2.0               h807b86a_2    conda-forge
liblapack                 3.8.0                    16_mkl    conda-forge
liblapacke                3.8.0                    16_mkl    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.44.0               h2797004_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_2    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libuv                     1.46.0               hd590300_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
markdown                  3.5.1                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
mkl                       2020.2                      256  
ml-collections            0.1.1              pyhd8ed1ab_0    conda-forge
ml-dtypes                 0.2.0                    pypi_0    pypi
ncurses                   6.4                  h59595ed_2    conda-forge
ninja                     1.11.1               h924138e_0    conda-forge
numpy                     1.22.4           py39hc58783e_0    conda-forge
nvidia-cublas-cu11        11.11.3.6                pypi_0    pypi
nvidia-cuda-cupti-cu11    11.8.87                  pypi_0    pypi
nvidia-cuda-nvcc-cu11     11.8.89                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.8.89                  pypi_0    pypi
nvidia-cudnn-cu11         8.7.0.84                 pypi_0    pypi
nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
nvidia-curand-cu11        10.3.0.86                pypi_0    pypi
nvidia-cusolver-cu11      11.4.1.48                pypi_0    pypi
nvidia-cusparse-cu11      11.7.5.86                pypi_0    pypi
nvidia-nccl-cu11          2.16.5                   pypi_0    pypi
oauthlib                  3.2.2                    pypi_0    pypi
openssl                   3.1.4                hd590300_0    conda-forge
opt-einsum                3.3.0                    pypi_0    pypi
pip                       23.3.1             pyhd8ed1ab_0    conda-forge
pyasn1                    0.5.0                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pyrosetta                 2023.44+release.7762b42          py39_0    https://conda.graylab.jhu.edu
python                    3.9.18          h0755675_0_cpython    conda-forge
python_abi                3.9                      4_cp39    conda-forge
pytorch                   1.9.1           py3.9_cuda11.1_cudnn8.0.5_0    pytorch
pyyaml                    6.0.1            py39hd1e30aa_1    conda-forge
readline                  8.2                  h8228510_1    conda-forge
requests                  2.31.0                   pypi_0    pypi
requests-oauthlib         1.3.1                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
tabulate                  0.9.0                    pypi_0    pypi
tensorboard               2.14.1                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorflow                2.14.0                   pypi_0    pypi
tensorflow-estimator      2.14.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.34.0                   pypi_0    pypi
tensorrt                  8.5.3.1                  pypi_0    pypi
termcolor                 2.3.0                    pypi_0    pypi
tk                        8.6.13          noxft_h4845f30_101    conda-forge
typing_extensions         4.8.0              pyha770c72_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
urllib3                   2.0.7                    pypi_0    pypi
werkzeug                  3.0.1                    pypi_0    pypi
wheel                     0.41.3             pyhd8ed1ab_0    conda-forge
wrapt                     1.14.1                   pypi_0    pypi
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
zipp                      3.17.0                   pypi_0    pypi
zlib                      1.2.13               hd590300_5    conda-forge

I'm fairly happy that tensorflow is working correctly and the CUDA versions are correct. My system is running Driver Version: 510.39.01 CUDA Version: 11.6

>>> import tensorflow.compat.v1 as tf
>>> tf.sysconfig.get_build_info()
OrderedDict([('cpu_compiler', '/usr/lib/llvm-16/bin/clang'), ('cuda_compute_capabilities', ['sm_35', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']), ('cuda_version', '11.8'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])

Conversely, if I install a CPU only version of tensorflow, both pyrosetta and tensorflow cooperate - however inference is very slow (as expected).

I have tried on a system with GTX3090s (Driver 510.39.01, CUDA 11.6) and a system with an A6000 (Driver 535.54.03, CUDA 12.2).

The issue is similar. When the GPU is not detected i.e. tf.config.list_physical_devices('GPU') reports and empty list [], then pyrosetta and tensorflow cooperate. As soon as tf reports physical devices, I encounter segmentation faults. I have tried with tf-nightly, the most recent pyrosetta, and various python versions.

Any help at all would be very, very appreciated.

Kindly, Charles

nrbennet commented 10 months ago

Hi Charles, two things:

  1. I would recommend installing the envs using the new split conda ymls which are much easier to install.
  2. For installing the af2 env, you only need JAX to be GPU-aware (if it passes the import test that I've included then you're golden). AF2 uses TF for some parsing stuff but doesn't actually run any of the model using TF so that doesn't need to be GPU-aware
charbj commented 10 months ago

Hi Nathaniel,

Wicked - thanks. I have it working now. The TF issue had sent me down a long rabbit hole.

I appreciate your help, I'll close the issue.

You guys write awesome software :)

CBJ

fatimadavila commented 4 months ago

Hi, I've tried installing the envs using the new split conda ymls and when I run af2_importtest.py I get a seg fault when importing pyrosetta. I've traced the issue to be some sort of compatibility problem between jax/tensorflow and pyrosetta. Any idea of how I can troubleshoot this? I've tried installing the environment in multiple ways and installing different versions of pyrosetta.

Edit: Can any of you please share what version of pyrosetta you're using?

Thanks! Fátima