microsoft / molecule-generation

Implementation of MoLeR: a generative model of molecular graphs which supports scaffold-constrained generation
MIT License
266 stars 43 forks source link

libdevice not found during training using default conda environment on Ubuntu 22.04.2 with a RTX A4000 #61

Closed phcavelar closed 1 year ago

phcavelar commented 1 year ago

Hello, just to let you know that when running molecule-generation train following the Readme.md, with the default conda environment, on Ubuntu 22.04.2 with a RTX A4000 fails by not finding libdevice, log below.

I've found that pinning Tensorflow to version 2.10 instead of 2.11 (latest version and installed automatically at time of writing) as per this stackoverflow question fixes it.

If you wish, I can open a PR to pin the TF version to be 2.10 or lower until this is fixed upstream as it was also cited as a solution for #56 , or else I'm at least posting this here so that other people can find this error and solution more easily.

Error Log ```sh Avg weighted sum. of graph losses: 291.5334 Avg weighted sum. of prop losses: 0.5965 Avg node class. loss: 71.0492 Avg first node class. loss: 40.7059 Avg edge selection loss: 1.7546 Avg edge type loss: 4.0202 Avg attachment point selection loss: 1.1500 Avg KL divergence: 6981316.0000 Property results: sa_score: MAE 10.77, MSE 3818.02 (norm MAE: 13.31) | clogp: MAE 23.54, MSE 13726.24 (norm MAE: 12.95) | mol_weight: MAE 393.53, MSE 168733.92 (norm MAE: 3.57). (Stored model metadata and weights to ~/data/moler/saved/GNN_Edge_MLP_MoLeR__2023-06-21_09-51-05_best.pkl). 2023-06-21 09:52:54.588760: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.595713: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.612998: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.618529: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.647620: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.663816: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.683986: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.702780: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.723474: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-06-21 09:52:54.741439: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc Traceback (most recent call last): File "~/miniconda3/envs/moler-env/bin/molecule_generation", line 8, in sys.exit(main()) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/cli.py", line 35, in main run_and_debug(lambda: commands[args.command].run_from_args(args), getattr(args, "debug", False)) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/dpu_utils/utils/debughelper.py", line 21, in run_and_debug func() File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/cli.py", line 35, in run_and_debug(lambda: commands[args.command].run_from_args(args), getattr(args, "debug", False)) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/train.py", line 179, in run_from_args trained_model_path = train( File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/train.py", line 274, in train train_loss, train_speed, train_results = model.run_on_data_iterator( File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/models/moler_base_model.py", line 244, in run_on_data_iterator task_metrics = self._run_step(batch_features, batch_labels, training) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tf2_gnn/models/graph_task_model.py", line 336, in _run_step return self._fast_run_step(batch_features_tuple, batch_labels_tuple, training) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InternalError: Graph execution error: Detected at node 'cond/StatefulPartitionedCall_122' defined at (most recent call last): File "~/miniconda3/envs/moler-env/bin/molecule_generation", line 8, in sys.exit(main()) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/cli.py", line 35, in main run_and_debug(lambda: commands[args.command].run_from_args(args), getattr(args, "debug", False)) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/dpu_utils/utils/debughelper.py", line 21, in run_and_debug func() File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/cli.py", line 35, in run_and_debug(lambda: commands[args.command].run_from_args(args), getattr(args, "debug", False)) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/train.py", line 179, in run_from_args trained_model_path = train( File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/cli/train.py", line 252, in train _, _, initial_valid_results = model.run_on_data_iterator( File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/molecule_generation/models/moler_base_model.py", line 244, in run_on_data_iterator task_metrics = self._run_step(batch_features, batch_labels, training) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tf2_gnn/models/graph_task_model.py", line 336, in _run_step return self._fast_run_step(batch_features_tuple, batch_labels_tuple, training) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tf2_gnn/models/graph_task_model.py", line 363, in _fast_run_step tf.cond(training, true_fn=_training_update, false_fn=_no_op) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tf2_gnn/models/graph_task_model.py", line 357, in _training_update self._apply_gradients(zip(gradients, self.trainable_variables)) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/tf2_gnn/models/graph_task_model.py", line 324, in _apply_gradients self._optimizer.apply_gradients(gradient_variable_pairs) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients return super().apply_gradients(grads_and_vars, name=name) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients iteration = self._internal_apply_gradients(grads_and_vars) File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients return tf.__internal__.distribute.interim.maybe_merge_call( File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn distribution.extended.update( File "~/miniconda3/envs/moler-env/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var return self._update_step_xla(grad, var, id(self._var_key(var))) Node: 'cond/StatefulPartitionedCall_122' libdevice not found at ./libdevice.10.bc [[{{node cond/StatefulPartitionedCall_122}}]] [Op:__inference__fast_run_step_84892] ```
Conda Environment before pip install When I re-created the environment without the restriction this is the dependency list shown before installing `molecule-generation`: ``` # packages in environment at ~/miniconda3/envs/moler-env: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge absl-py 1.4.0 pyhd8ed1ab_0 conda-forge aiohttp 3.8.4 py310h2372a71_1 conda-forge aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge astunparse 1.6.3 pyhd8ed1ab_0 conda-forge async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge attrs 23.1.0 pyh71513ae_1 conda-forge blinker 1.6.2 pyhd8ed1ab_0 conda-forge boost 1.78.0 py310hc4a4660_4 conda-forge boost-cpp 1.78.0 h6582d0a_3 conda-forge brotli 1.0.9 h166bdaf_8 conda-forge brotli-bin 1.0.9 h166bdaf_8 conda-forge brotlipy 0.7.0 py310h5764c6d_1005 conda-forge bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.19.1 hd590300_0 conda-forge ca-certificates 2023.5.7 hbcca054_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 5.3.0 pyhd8ed1ab_0 conda-forge cairo 1.16.0 hbbf8b49_1016 conda-forge certifi 2023.5.7 pyhd8ed1ab_0 conda-forge cffi 1.15.1 py310h255011f_3 conda-forge charset-normalizer 3.1.0 pyhd8ed1ab_0 conda-forge click 8.1.3 unix_pyhd8ed1ab_2 conda-forge contourpy 1.1.0 py310hd41b1e2_0 conda-forge cryptography 41.0.1 py310h75e40e8_0 conda-forge cuda-version 11.8 h70ddcb2_2 conda-forge cudatoolkit 11.8.0 h37601d7_11 conda-forge cudnn 8.8.0.121 h0800d71_1 conda-forge cycler 0.11.0 pyhd8ed1ab_0 conda-forge expat 2.5.0 hcb278e6_1 conda-forge flatbuffers 23.3.3 hcb278e6_1 conda-forge font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge font-ttf-inconsolata 3.000 h77eed37_0 conda-forge font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge font-ttf-ubuntu 0.83 hab24e00_0 conda-forge fontconfig 2.14.2 h14ed4e7_0 conda-forge fonts-conda-ecosystem 1 0 conda-forge fonts-conda-forge 1 0 conda-forge fonttools 4.40.0 py310h2372a71_0 conda-forge freetype 2.12.1 hca18f0e_1 conda-forge frozenlist 1.3.3 py310h5764c6d_0 conda-forge gast 0.4.0 pyh9f0ad1d_0 conda-forge gettext 0.21.1 h27087fc_0 conda-forge giflib 5.2.1 h0b41bf4_3 conda-forge google-auth 2.20.0 pyh1a96a4e_0 conda-forge google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge google-pasta 0.2.0 pyh8c360ce_0 conda-forge greenlet 2.0.2 py310hc6cd4ac_1 conda-forge grpcio 1.51.1 py310h4a5735c_1 conda-forge h5py 3.9.0 nompi_py310h367e799_100 conda-forge hdf5 1.14.0 nompi_hb72d44e_103 conda-forge icu 72.1 hcb278e6_0 conda-forge idna 3.4 pyhd8ed1ab_0 conda-forge importlib-metadata 6.7.0 pyha770c72_0 conda-forge keras 2.11.0 pyhd8ed1ab_0 conda-forge keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.4 py310hbf28c38_1 conda-forge krb5 1.20.1 h81ceb04_0 conda-forge lcms2 2.15 haa2dc70_1 conda-forge ld_impl_linux-64 2.40 h41732ed_0 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libabseil 20220623.0 cxx17_h05df665_6 conda-forge libaec 1.0.6 hcb278e6_1 conda-forge libblas 3.9.0 17_linux64_openblas conda-forge libbrotlicommon 1.0.9 h166bdaf_8 conda-forge libbrotlidec 1.0.9 h166bdaf_8 conda-forge libbrotlienc 1.0.9 h166bdaf_8 conda-forge libcblas 3.9.0 17_linux64_openblas conda-forge libcurl 8.1.2 h409715c_0 conda-forge libdeflate 1.18 h0b41bf4_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libexpat 2.5.0 hcb278e6_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.1.0 he5830b7_0 conda-forge libgfortran-ng 13.1.0 h69a702a_0 conda-forge libgfortran5 13.1.0 h15d22d2_0 conda-forge libglib 2.76.3 hebfc3b9_0 conda-forge libgomp 13.1.0 he5830b7_0 conda-forge libgrpc 1.51.1 h4fad500_1 conda-forge libiconv 1.17 h166bdaf_0 conda-forge libjpeg-turbo 2.1.5.1 h0b41bf4_0 conda-forge liblapack 3.9.0 17_linux64_openblas conda-forge libnghttp2 1.52.0 h61bc06f_0 conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libopenblas 0.3.23 pthreads_h80387f5_0 conda-forge libpng 1.6.39 h753d276_0 conda-forge libprotobuf 3.21.12 h3eb15da_0 conda-forge libsqlite 3.42.0 h2797004_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 13.1.0 hfd8a6a1_0 conda-forge libtiff 4.5.1 h8b53f26_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp-base 1.3.0 h0b41bf4_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libzlib 1.2.13 hd590300_5 conda-forge markdown 3.4.3 pyhd8ed1ab_0 conda-forge markupsafe 2.1.3 py310h2372a71_0 conda-forge matplotlib-base 3.7.1 py310he60537e_0 conda-forge multidict 6.0.4 py310h1fa729e_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge nccl 2.18.3.1 h12f7317_0 conda-forge ncurses 6.4 hcb278e6_0 conda-forge numpy 1.25.0 py310ha4c1d20_0 conda-forge oauthlib 3.2.2 pyhd8ed1ab_0 conda-forge openjpeg 2.5.0 hfec8fc6_2 conda-forge openssl 3.1.1 hd590300_1 conda-forge opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge packaging 23.1 pyhd8ed1ab_0 conda-forge pandas 2.0.2 py310h7cbd5c2_0 conda-forge pcre2 10.40 hc3806b6_0 conda-forge pillow 9.5.0 py310h582fbeb_1 conda-forge pip 23.1.2 pyhd8ed1ab_0 conda-forge pixman 0.40.0 h36c2ea0_0 conda-forge platformdirs 3.6.0 pyhd8ed1ab_0 conda-forge pooch 1.7.0 pyha770c72_3 conda-forge protobuf 4.21.12 py310heca2aa9_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.7 py_0 conda-forge pycairo 1.24.0 py310hda9f760_0 conda-forge pycparser 2.21 pyhd8ed1ab_0 conda-forge pyjwt 2.7.0 pyhd8ed1ab_0 conda-forge pyopenssl 23.2.0 pyhd8ed1ab_1 conda-forge pyparsing 3.1.0 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge python 3.10.11 he550d4f_0_cpython conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-flatbuffers 23.5.26 pyhd8ed1ab_0 conda-forge python-tzdata 2023.3 pyhd8ed1ab_0 conda-forge python_abi 3.10 3_cp310 conda-forge pytz 2023.3 pyhd8ed1ab_0 conda-forge pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge rdkit 2023.03.2 py310h399bcf7_0 conda-forge re2 2023.02.01 hcb278e6_0 conda-forge readline 8.2 h8228510_1 conda-forge reportlab 3.6.13 py310h1a56a1c_0 conda-forge requests 2.31.0 pyhd8ed1ab_0 conda-forge requests-oauthlib 1.3.1 pyhd8ed1ab_0 conda-forge rsa 4.9 pyhd8ed1ab_0 conda-forge scipy 1.10.1 py310ha4c1d20_3 conda-forge setuptools 67.7.2 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge snappy 1.1.10 h9fff704_0 conda-forge sqlalchemy 2.0.16 py310h2372a71_0 conda-forge tensorboard 2.11.2 pyhd8ed1ab_0 conda-forge tensorboard-data-server 0.6.1 py310h600f1e7_4 conda-forge tensorboard-plugin-wit 1.8.1 pyhd8ed1ab_0 conda-forge tensorflow 2.11.1 cuda112py310he87a039_0 conda-forge tensorflow-base 2.11.1 cuda112py310h4c92a00_0 conda-forge tensorflow-estimator 2.11.1 cuda112py310h37add04_0 conda-forge termcolor 2.3.0 pyhd8ed1ab_0 conda-forge tk 8.6.12 h27826a3_0 conda-forge typing-extensions 4.6.3 hd8ed1ab_0 conda-forge typing_extensions 4.6.3 pyha770c72_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge unicodedata2 15.0.0 py310h5764c6d_0 conda-forge urllib3 1.26.15 pyhd8ed1ab_0 conda-forge werkzeug 2.3.6 pyhd8ed1ab_0 conda-forge wheel 0.40.0 pyhd8ed1ab_0 conda-forge wrapt 1.15.0 py310h1fa729e_0 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.6 h8ee46fc_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.10 h7f98852_1003 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yarl 1.9.2 py310h2372a71_0 conda-forge zipp 3.15.0 pyhd8ed1ab_0 conda-forge zlib 1.2.13 hd590300_5 conda-forge zstd 1.5.2 h3eb15da_6 conda-forge ```
Conda environment after pip install And after running `pip install molecule-generation`: ``` # packages in environment at ~/miniconda3/envs/moler-env: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge absl-py 1.4.0 pyhd8ed1ab_0 conda-forge aiohttp 3.8.4 py310h2372a71_1 conda-forge aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge astunparse 1.6.3 pyhd8ed1ab_0 conda-forge async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge attrs 23.1.0 pyh71513ae_1 conda-forge azure-core 1.27.1 pypi_0 pypi azure-identity 1.13.0 pypi_0 pypi azure-storage-blob 12.16.0 pypi_0 pypi blinker 1.6.2 pyhd8ed1ab_0 conda-forge boost 1.78.0 py310hc4a4660_4 conda-forge boost-cpp 1.78.0 h6582d0a_3 conda-forge brotli 1.0.9 h166bdaf_8 conda-forge brotli-bin 1.0.9 h166bdaf_8 conda-forge brotlipy 0.7.0 py310h5764c6d_1005 conda-forge bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.19.1 hd590300_0 conda-forge ca-certificates 2023.5.7 hbcca054_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 5.3.0 pyhd8ed1ab_0 conda-forge cairo 1.16.0 hbbf8b49_1016 conda-forge certifi 2023.5.7 pyhd8ed1ab_0 conda-forge cffi 1.15.1 py310h255011f_3 conda-forge charset-normalizer 3.1.0 pyhd8ed1ab_0 conda-forge click 8.1.3 unix_pyhd8ed1ab_2 conda-forge contourpy 1.1.0 py310hd41b1e2_0 conda-forge cryptography 41.0.1 py310h75e40e8_0 conda-forge cuda-version 11.8 h70ddcb2_2 conda-forge cudatoolkit 11.8.0 h37601d7_11 conda-forge cudnn 8.8.0.121 h0800d71_1 conda-forge cycler 0.11.0 pyhd8ed1ab_0 conda-forge docopt 0.6.2 pypi_0 pypi dpu-utils 0.6.1 pypi_0 pypi expat 2.5.0 hcb278e6_1 conda-forge flatbuffers 23.3.3 hcb278e6_1 conda-forge font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge font-ttf-inconsolata 3.000 h77eed37_0 conda-forge font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge font-ttf-ubuntu 0.83 hab24e00_0 conda-forge fontconfig 2.14.2 h14ed4e7_0 conda-forge fonts-conda-ecosystem 1 0 conda-forge fonts-conda-forge 1 0 conda-forge fonttools 4.40.0 py310h2372a71_0 conda-forge freetype 2.12.1 hca18f0e_1 conda-forge frozenlist 1.3.3 py310h5764c6d_0 conda-forge gast 0.4.0 pyh9f0ad1d_0 conda-forge gettext 0.21.1 h27087fc_0 conda-forge giflib 5.2.1 h0b41bf4_3 conda-forge google-auth 2.20.0 pyh1a96a4e_0 conda-forge google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge google-pasta 0.2.0 pyh8c360ce_0 conda-forge greenlet 2.0.2 py310hc6cd4ac_1 conda-forge grpcio 1.51.1 py310h4a5735c_1 conda-forge h5py 3.9.0 nompi_py310h367e799_100 conda-forge hdf5 1.14.0 nompi_hb72d44e_103 conda-forge icu 72.1 hcb278e6_0 conda-forge idna 3.4 pyhd8ed1ab_0 conda-forge importlib-metadata 6.7.0 pyha770c72_0 conda-forge isodate 0.6.1 pypi_0 pypi joblib 1.2.0 pypi_0 pypi keras 2.11.0 pyhd8ed1ab_0 conda-forge keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.4 py310hbf28c38_1 conda-forge krb5 1.20.1 h81ceb04_0 conda-forge lcms2 2.15 haa2dc70_1 conda-forge ld_impl_linux-64 2.40 h41732ed_0 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libabseil 20220623.0 cxx17_h05df665_6 conda-forge libaec 1.0.6 hcb278e6_1 conda-forge libblas 3.9.0 17_linux64_openblas conda-forge libbrotlicommon 1.0.9 h166bdaf_8 conda-forge libbrotlidec 1.0.9 h166bdaf_8 conda-forge libbrotlienc 1.0.9 h166bdaf_8 conda-forge libcblas 3.9.0 17_linux64_openblas conda-forge libcurl 8.1.2 h409715c_0 conda-forge libdeflate 1.18 h0b41bf4_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libexpat 2.5.0 hcb278e6_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.1.0 he5830b7_0 conda-forge libgfortran-ng 13.1.0 h69a702a_0 conda-forge libgfortran5 13.1.0 h15d22d2_0 conda-forge libglib 2.76.3 hebfc3b9_0 conda-forge libgomp 13.1.0 he5830b7_0 conda-forge libgrpc 1.51.1 h4fad500_1 conda-forge libiconv 1.17 h166bdaf_0 conda-forge libjpeg-turbo 2.1.5.1 h0b41bf4_0 conda-forge liblapack 3.9.0 17_linux64_openblas conda-forge libnghttp2 1.52.0 h61bc06f_0 conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libopenblas 0.3.23 pthreads_h80387f5_0 conda-forge libpng 1.6.39 h753d276_0 conda-forge libprotobuf 3.21.12 h3eb15da_0 conda-forge libsqlite 3.42.0 h2797004_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 13.1.0 hfd8a6a1_0 conda-forge libtiff 4.5.1 h8b53f26_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp-base 1.3.0 h0b41bf4_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libzlib 1.2.13 hd590300_5 conda-forge markdown 3.4.3 pyhd8ed1ab_0 conda-forge markupsafe 2.1.3 py310h2372a71_0 conda-forge matplotlib-base 3.7.1 py310he60537e_0 conda-forge molecule-generation 0.4.0 pypi_0 pypi more-itertools 9.1.0 pypi_0 pypi msal 1.22.0 pypi_0 pypi msal-extensions 1.0.0 pypi_0 pypi multidict 6.0.4 py310h1fa729e_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge nccl 2.18.3.1 h12f7317_0 conda-forge ncurses 6.4 hcb278e6_0 conda-forge numpy 1.25.0 py310ha4c1d20_0 conda-forge oauthlib 3.2.2 pyhd8ed1ab_0 conda-forge openjpeg 2.5.0 hfec8fc6_2 conda-forge openssl 3.1.1 hd590300_1 conda-forge opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge packaging 23.1 pyhd8ed1ab_0 conda-forge pandas 2.0.2 py310h7cbd5c2_0 conda-forge pcre2 10.40 hc3806b6_0 conda-forge pillow 9.5.0 py310h582fbeb_1 conda-forge pip 23.1.2 pyhd8ed1ab_0 conda-forge pixman 0.40.0 h36c2ea0_0 conda-forge platformdirs 3.6.0 pyhd8ed1ab_0 conda-forge pooch 1.7.0 pyha770c72_3 conda-forge portalocker 2.7.0 pypi_0 pypi protobuf 3.20.3 pypi_0 pypi pthread-stubs 0.4 h36c2ea0_1001 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.7 py_0 conda-forge pycairo 1.24.0 py310hda9f760_0 conda-forge pycparser 2.21 pyhd8ed1ab_0 conda-forge pyjwt 2.7.0 pyhd8ed1ab_0 conda-forge pyopenssl 23.2.0 pyhd8ed1ab_1 conda-forge pyparsing 3.1.0 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge python 3.10.11 he550d4f_0_cpython conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-flatbuffers 23.5.26 pyhd8ed1ab_0 conda-forge python-tzdata 2023.3 pyhd8ed1ab_0 conda-forge python_abi 3.10 3_cp310 conda-forge pytz 2023.3 pyhd8ed1ab_0 conda-forge pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge rdkit 2023.03.2 py310h399bcf7_0 conda-forge re2 2023.02.01 hcb278e6_0 conda-forge readline 8.2 h8228510_1 conda-forge regex 2023.6.3 pypi_0 pypi reportlab 3.6.13 py310h1a56a1c_0 conda-forge requests 2.31.0 pyhd8ed1ab_0 conda-forge requests-oauthlib 1.3.1 pyhd8ed1ab_0 conda-forge rsa 4.9 pyhd8ed1ab_0 conda-forge scikit-learn 1.2.2 pypi_0 pypi scipy 1.10.1 py310ha4c1d20_3 conda-forge sentencepiece 0.1.99 pypi_0 pypi setsimilaritysearch 1.0.1 pypi_0 pypi setuptools 67.7.2 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge snappy 1.1.10 h9fff704_0 conda-forge sqlalchemy 2.0.16 py310h2372a71_0 conda-forge tensorboard 2.11.2 pyhd8ed1ab_0 conda-forge tensorboard-data-server 0.6.1 py310h600f1e7_4 conda-forge tensorboard-plugin-wit 1.8.1 pyhd8ed1ab_0 conda-forge tensorflow 2.11.1 cuda112py310he87a039_0 conda-forge tensorflow-base 2.11.1 cuda112py310h4c92a00_0 conda-forge tensorflow-estimator 2.11.1 cuda112py310h37add04_0 conda-forge termcolor 2.3.0 pyhd8ed1ab_0 conda-forge tf2-gnn 2.13.0 pypi_0 pypi threadpoolctl 3.1.0 pypi_0 pypi tk 8.6.12 h27826a3_0 conda-forge tqdm 4.65.0 pypi_0 pypi typing-extensions 4.6.3 hd8ed1ab_0 conda-forge typing_extensions 4.6.3 pyha770c72_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge unicodedata2 15.0.0 py310h5764c6d_0 conda-forge urllib3 1.26.15 pyhd8ed1ab_0 conda-forge werkzeug 2.3.6 pyhd8ed1ab_0 conda-forge wheel 0.40.0 pyhd8ed1ab_0 conda-forge wrapt 1.15.0 py310h1fa729e_0 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.6 h8ee46fc_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.10 h7f98852_1003 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yarl 1.9.2 py310h2372a71_0 conda-forge zipp 315.0 pyhd8ed1ab_0 conda-forge zlib 1.2.13 hd590300_5 conda-forge zstd 1.5.2 h3eb15da_6 conda-forge ```
### Tasks
- [ ] Fix TF version to be less or equals to 2.10
kmaziarz commented 1 year ago

Does this happen only on very specific machines? Is it considered a bug in tensorflow that is expected to be fixed in a future version? I'm wondering what is the right course of action for us, given that environment.yml is also used in CI (and the versions are unpinned there to detect when newest tensorflow breaks our code)...

phcavelar commented 1 year ago

I don't know how specific this would be to Ubuntu or to the GPU I was running on, since the Stackoverflow question doesn't specify which GPU they had when that problem started.

However, I saw that the only CI job that uses TF 2.11 installs the CPU version of Tensorflow:

     + tensorflow                       2.11.1  cpu_py310hd1aba9c_0      conda-forge       31kB

Which might be the reason why the CI pipeline isn't catching the problem. I understand that it might not be practical to run a CI job with a machine with a GPU, but, if you have the resources for it, it might something to consider, since the code in this repo is most likely going to be ran with a GPU and it is exactly my GPU setup that failed. This will allow you be able to catch these nasty CUDA-related bugs by doing so, at least so that people are aware of which versions might fail in the future.

For now, I think just having this thread might be enough to help anyone that stumbles upon this issue, but it'd be even better to put it as a warning on the readme as some people might not always look up past issues before opening a new one.

kmaziarz commented 1 year ago

I saw that the only CI job that uses TF 2.11 installs the CPU version of Tensorflow

Yes, I guess that's expected; I wouldn't expect the standard CI agents to have a GPU. That being said, I noticed that the Python 3.8 build seems to be installing CUDA libraries and GPU-enabled Tensorflow... I need to take a closer look to understand what's going on here.

Coming back to your issue, the Tensorflow install guide has a section called "Ubuntu 22.04" which seems to talk about the exact problem you're having? They mention a way to fix it which does not involve downgrading Tensorflow. If this is indeed a fix, then maybe I can mention that in the README.md explicitly (we already point to the Tensorflow website for guidelines on installation, just without specific details).

kmaziarz commented 1 year ago

(For me the instructions from the Tensorflow website resolve the issue, and training under 2.13.0 seems to be working)