torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

bad_alloc using PyTorch 1.12 #125

Closed peastman closed 10 months ago

peastman commented 2 years ago

I have a model created with TorchMD-Net. I want to use it for running a simulation in OpenMM. That involves compiling to TorchScript, saving to a file, and loading it with the PyTorch C++ API. When I try to do that, it crashes with a bad_alloc down inside libtorch.

Is this expected to work? Or do some of the packages like pyg and torch-cluster not support that workflow? If it's known not to work right now, what would need to happen to make it work?

PhilippThoelke commented 2 years ago

I haven't tried using TorchMD-Net in C++ so I don't know. You could try breaking the model down and exporting submodules to narrow down the issue. Maybe also trying just a small message passing pyg example to see if that's the issue. The implementation uses rather basic PyTorch functionalities except pyg's message passing implementation and custom kernels.

peastman commented 2 years ago

For what it's worth, gdb shows the error happens inside torch_cluster.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff6e567f1 in __GI_abort () at abort.c:79
#2  0x00007fffc228e036 in __gnu_cxx::__verbose_terminate_handler ()
    at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007fffc228c524 in __cxxabiv1::__terminate (handler=<optimized out>)
    at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007fffc228c576 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00007fffc228c768 in __cxxabiv1::__cxa_throw (obj=0x5555597a3f50, 
    tinfo=0x7fffc2380278 <typeinfo for std::bad_alloc>, 
    dest=0x7fffc228b0e4 <std::bad_alloc::~bad_alloc()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:98
#6  0x00007fffc228cb95 in operator new (sz=140734623278748)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1652324151713/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/new:64
#7  0x00007fffb9dc442e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*> ()
   from /home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fffb9dc52e6 in c10::RegisterOperators::inferSchemaFromKernels_(c10::OperatorName const&, c10::RegisterOperators::Options const&) ()
   from /home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fffb9dc8152 in c10::RegisterOperators::checkSchemaAndRegisterOp_(c10::RegisterOperators::Options&&) ()
   from /home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fff553a469d in std::enable_if<c10::guts::is_function_type<long ()>::value&&(!std::is_same<long (), void (c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*)>::value), c10::RegisterOperators&&>::type c10::RegisterOperators::op<long ()>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long (*)(), c10::RegisterOperators::Options&&) && ()
   from /home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch_cluster/_version_cuda.so
#11 0x00007fff553a217d in ?? ()
   from /home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch_cluster/_version_cuda.so
#12 0x00007ffff7de38d3 in call_init (env=0x555557f25be0, argv=0x7fffffffbb48, argc=2, 
    l=<optimized out>) at dl-init.c:72
#13 _dl_init (main_map=main_map@entry=0x5555598792d0, argc=2, argv=0x7fffffffbb48, 
    env=0x555557f25be0) at dl-init.c:119
#14 0x00007ffff7de839f in dl_open_worker (a=a@entry=0x7fffffff1a70) at dl-open.c:522
#15 0x00007ffff6f7d16f in __GI__dl_catch_exception (exception=0x7fffffff1a50, 
    operate=0x7ffff7de7f60 <dl_open_worker>, args=0x7fffffff1a70) at dl-error-skeleton.c:196
#16 0x00007ffff7de796a in _dl_open (
    file=0x7fff553b9440 "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch_cluster/_version_cuda.so", mode=-2147483646, caller_dlopen=0x7ffff7e7df3e <py_dl_open+142>, 
    nsid=<optimized out>, argc=2, argv=<optimized out>, env=0x555557f25be0) at dl-open.c:605

(continuing on up to stack frame #491)

peastman commented 2 years ago

Sorry, it looks like I misdiagnosed what the problem is. The error actually occurs as soon as I import torchmdnet, and it's caused by upgrading to PyTorch 1.12. I create an environment like this:

mamba env create -f environment.yml
conda activate torchmd-net
pip install -e .

At that point things work correctly. So now upgrade PyTorch:

mamba install -c conda-forge pytorch=1.12

and execute the command

python -c "from torchmdnet.models.model import load_model"

It fails with

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

This is on Ubuntu 20.04. Here's the complete environment.

# packages in environment at /home/peastman/miniconda3/envs/torchmd-net:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   1.2.0              pyhd8ed1ab_0    conda-forge
aiohttp                   3.8.3            py39hb9d737c_0    conda-forge
aiosignal                 1.2.0              pyhd8ed1ab_0    conda-forge
alsa-lib                  1.2.3.2              h166bdaf_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attrs                     22.1.0             pyh71513ae_1    conda-forge
blinker                   1.4                        py_1    conda-forge
brotli                    1.0.9                h166bdaf_7    conda-forge
brotli-bin                1.0.9                h166bdaf_7    conda-forge
brotlipy                  0.7.0           py39hb9d737c_1004    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.9.14            ha878542_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
cachetools                5.2.0              pyhd8ed1ab_0    conda-forge
certifi                   2022.9.14          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1           py39he91dace_0    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.1.3            py39hf3d152e_0    conda-forge
colorama                  0.4.5              pyhd8ed1ab_0    conda-forge
contourpy                 1.0.5            py39hf939315_0    conda-forge
coverage                  6.4.4            py39hb9d737c_0    conda-forge
cryptography              37.0.1           py39h9ce1e76_0  
cudatoolkit               11.7.0              hd8887f6_10    conda-forge
cudnn                     8.4.1.50             hed8a83a_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
dbus                      1.13.18              hb2f20db_0  
expat                     2.4.9                h27087fc_0    conda-forge
flake8                    5.0.4              pyhd8ed1ab_0    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.0               hc2a2eb6_1    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.37.3           py39hb9d737c_0    conda-forge
freetype                  2.12.1               hca18f0e_0    conda-forge
frozenlist                1.3.1            py39hb9d737c_0    conda-forge
fsspec                    2022.8.2           pyhd8ed1ab_0    conda-forge
future                    0.18.2           py39hf3d152e_5    conda-forge
gettext                   0.21.0               hf68c758_0  
glib                      2.72.1               h6239696_0    conda-forge
glib-tools                2.72.1               h6239696_0    conda-forge
google-auth               2.11.0             pyh6c4a22f_0    conda-forge
google-auth-oauthlib      0.4.1                      py_2    conda-forge
googledrivedownloader     0.4                pyhd3deb0d_1    conda-forge
grpc-cpp                  1.48.1               hc2bec63_1    conda-forge
grpcio                    1.48.1           py39hfaff5cf_1    conda-forge
gst-plugins-base          1.20.2               hcf0ee16_0    conda-forge
gstreamer                 1.20.3               hd4edc92_2    conda-forge
h5py                      3.7.0           nompi_py39hd51670d_101    conda-forge
hdf5                      1.12.2          nompi_h4df4325_100    conda-forge
html5lib                  1.1                pyh9f0ad1d_0    conda-forge
icu                       69.1                 h9c3ff4c_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        4.11.4           py39hf3d152e_0    conda-forge
importlib_metadata        4.11.4               hd8ed1ab_0    conda-forge
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
intel-openmp              2022.1.0          h9e868ea_3769  
isodate                   0.6.1              pyhd8ed1ab_0    conda-forge
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   h166bdaf_2    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4            py39hf939315_0    conda-forge
krb5                      1.19.3               h08a2579_0    conda-forge
lark-parser               0.12.0             pyhd8ed1ab_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      4.0.0                h27087fc_0    conda-forge
libabseil                 20220623.0      cxx17_h48a1fff_4    conda-forge
libblas                   3.9.0            16_linux64_mkl    conda-forge
libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
libbrotlidec              1.0.9                h166bdaf_7    conda-forge
libbrotlienc              1.0.9                h166bdaf_7    conda-forge
libcblas                  3.9.0            16_linux64_mkl    conda-forge
libclang                  13.0.1          default_hc23dcda_0    conda-forge
libcurl                   7.83.1               h2283fc2_0    conda-forge
libdeflate                1.14                 h166bdaf_0    conda-forge
libedit                   3.1.20210910         h7f8727e_0  
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h28343ad_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgfortran-ng            12.1.0              h69a702a_16    conda-forge
libgfortran5              12.1.0              hdcd56e2_16    conda-forge
libglib                   2.72.1               h2d90d5f_0    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0            16_linux64_mkl    conda-forge
libllvm13                 13.0.1               hf817b99_2    conda-forge
libnghttp2                1.47.0               hff17c54_1    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libogg                    1.3.5                h27cfd23_1  
libopus                   1.3.1                h7f98852_1    conda-forge
libpng                    1.6.38               h753d276_0    conda-forge
libpq                     14.5                 he2d8382_0    conda-forge
libprotobuf               3.20.1               h6239696_4    conda-forge
libsqlite                 3.39.3               h753d276_0    conda-forge
libssh2                   1.10.0               hf14f497_3    conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
libtiff                   4.4.0                h55922b4_4    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libvorbis                 1.3.7                he1b5a44_0    conda-forge
libwebp-base              1.2.4                h166bdaf_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxkbcommon              1.0.3                he3ba5ed_0    conda-forge
libxml2                   2.9.12               h885dcf4_1    conda-forge
libzlib                   1.2.12               h166bdaf_3    conda-forge
magma                     2.5.4                h6103c52_2    conda-forge
markdown                  3.4.1              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.1            py39hb9d737c_1    conda-forge
matplotlib                3.6.0            py39hf3d152e_0    conda-forge
matplotlib-base           3.6.0            py39hf9fd14e_0    conda-forge
mccabe                    0.7.0              pyhd8ed1ab_0    conda-forge
mkl                       2022.1.0           hc2b9512_224  
multidict                 6.0.2            py39hb9d737c_1    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
mysql-common              8.0.30               h26416b9_1    conda-forge
mysql-libs                8.0.30               hbc51c84_1    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
networkx                  2.8.6              pyhd8ed1ab_0    conda-forge
ninja                     1.11.0               h924138e_0    conda-forge
nnpops                    0.2             cuda112py39hcdac82f_5    conda-forge
nspr                      4.33                 h295c915_0  
nss                       3.78                 h2350873_0    conda-forge
numpy                     1.23.3           py39hba7629e_0    conda-forge
oauthlib                  3.2.1              pyhd8ed1ab_0    conda-forge
openjpeg                  2.5.0                h7d73246_1    conda-forge
openssl                   3.0.5                h166bdaf_2    conda-forge
packaging                 21.3               pyhd8ed1ab_0    conda-forge
pandas                    1.5.0            py39h4661b88_0    conda-forge
pcre                      8.45                 h9c3ff4c_0    conda-forge
pillow                    9.2.0            py39hd5dbb17_2    conda-forge
pip                       22.2.2             pyhd8ed1ab_0    conda-forge
pluggy                    1.0.0            py39hf3d152e_3    conda-forge
protobuf                  3.20.1           py39h5a03fae_0    conda-forge
psutil                    5.9.2            py39hb9d737c_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
py                        1.11.0             pyh6c4a22f_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.8                      py_0  
pycodestyle               2.9.1              pyhd8ed1ab_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pydeprecate               0.3.2              pyhd8ed1ab_0    conda-forge
pyflakes                  2.5.0              pyhd8ed1ab_0    conda-forge
pyjwt                     2.5.0              pyhd8ed1ab_0    conda-forge
pyopenssl                 22.0.0             pyhd8ed1ab_1    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pyqt                      5.12.3           py39hf3d152e_8    conda-forge
pyqt-impl                 5.12.3           py39hde8b62d_8    conda-forge
pyqt5-sip                 4.19.18          py39he80948d_8    conda-forge
pyqtchart                 5.12             py39h0fcd23e_8    conda-forge
pyqtwebengine             5.12.1           py39h0fcd23e_8    conda-forge
pysocks                   1.7.1            py39hf3d152e_5    conda-forge
pytest                    7.1.3            py39hf3d152e_0    conda-forge
pytest-cov                3.0.0              pyhd8ed1ab_0    conda-forge
python                    3.9.13          h2660328_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-louvain            0.15               pyhd8ed1ab_1    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.12.1          cuda112py39ha0cca9b_200    conda-forge
pytorch-gpu               1.12.1          cuda112py39h1894f8f_200    conda-forge
pytorch-lightning         1.6.3              pyhd8ed1ab_0    conda-forge
pytorch_cluster           1.5.9            py39hbba90f3_0    conda-forge
pytorch_geometric         2.0.3              pyhd8ed1ab_0    conda-forge
pytorch_scatter           2.0.9           cuda112py39h83a068c_0    conda-forge
pytorch_sparse            0.6.15           py39h83a068c_0    conda-forge
pytz                      2022.2.1           pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pyyaml                    6.0              py39hb9d737c_4    conda-forge
qt                        5.12.9               h1304e3e_6    conda-forge
rdflib                    6.2.0              pyhd8ed1ab_0    conda-forge
re2                       2022.06.01           h27087fc_0    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
requests                  2.28.1             pyhd8ed1ab_1    conda-forge
requests-oauthlib         1.3.1              pyhd8ed1ab_0    conda-forge
rsa                       4.9                pyhd8ed1ab_0    conda-forge
scikit-learn              1.1.2            py39he5e8d7e_0    conda-forge
scipy                     1.9.1            py39h8ba3f38_0    conda-forge
setuptools                59.5.0           py39hf3d152e_0    conda-forge
setuptools-scm            6.3.2              pyhd8ed1ab_0    conda-forge
setuptools_scm            6.3.2                hd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleef                     3.5.1                h28343ad_2    conda-forge
sqlite                    3.39.3               h4ff8645_0    conda-forge
tensorboard               2.6.0                      py_0  
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
torchani                  2.2.2           cuda112py39h527ec63_6    conda-forge
torchmd-net               0.2.4                     dev_0    <develop>
torchmetrics              0.8.2              pyhd8ed1ab_0    conda-forge
tornado                   6.2              py39hb9d737c_0    conda-forge
tqdm                      4.64.1             pyhd8ed1ab_0    conda-forge
typing-extensions         4.3.0                hd8ed1ab_0    conda-forge
typing_extensions         4.3.0              pyha770c72_0    conda-forge
tzdata                    2022c                h191b570_0    conda-forge
unicodedata2              14.0.0           py39hb9d737c_1    conda-forge
urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
webencodings              0.5.1                      py_1    conda-forge
werkzeug                  2.2.2              pyhd8ed1ab_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.8.1            py39h5eee18b_0  
zipp                      3.8.1              pyhd8ed1ab_0    conda-forge
zlib                      1.2.12               h166bdaf_3    conda-forge
zstd                      1.5.2                h6239696_4    conda-forge
peastman commented 2 years ago

Torch-cluster is indeed the source of the problem. I replaced the conda-forge build with one from PyPI with

pip install --force torch_cluster

and the segfault went away.

raimis commented 2 years ago

Do you install the same version of torch_cluser with conda and pip?

peastman commented 2 years ago

They're slightly different versions. The PyPI version is 1.6.0, but the most recent version on conda-forge is 1.5.9.

RaulPPelaez commented 10 months ago

Closing this since torch_clusted is not a dependency anymore.