gnina / libmolgrid

Comprehensive library for fast, GPU accelerated molecular gridding for deep learning workflows
https://gnina.github.io/libmolgrid/
Apache License 2.0
145 stars 48 forks source link

Segmentation fault #32

Closed koshkabb closed 4 years ago

koshkabb commented 4 years ago

Hello again!

I have a problem in which the next() method of ExampleProvider is causing a Segmentation Fault. After digging a bit into it, I found out that this happens if matplotlib (or something else that matplotlib is importing) is imported before molgrid:

>>> import matplotlib
>>> import molgrid
>>>
>>> e = molgrid.ExampleProvider()
>>> e.populate('single.types')
>>>
>>> e.next()
Segmentation fault

This doesn't happen if the order of the imports changes:

>>> import molgrid
>>> import matplotlib
>>> 
>>> e = molgrid.ExampleProvider()
>>> e.populate('single.types')
>>>
>>> e.next()
<molgrid.molgrid.Example object at 0x7fee343d69b0>
>>> 

I was just wondering if you have come by this issue or a similar one before. Thank you!

koshkabb commented 4 years ago

If it helps, I executed the problematic code with gdb python and got the following:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f15212e6a54 in std::ctype<wchar_t>::do_scan_is (this=0x7ffcea58daf0, __m=16672, __lo=0x20 <error: Cannot access memory at address 0x20>, 
    __hi=0x46 <error: Cannot access memory at address 0x46>) at ctype_members.cc:188
188     ctype_members.cc: No such file or directory.
dkoes commented 4 years ago

I can't reproduce this. What version of python and matplotlib are you using? Make sure you don't have multiple copies of molgrid installed.

In [1]: import matplotlib

In [2]: import molgrid

In [3]: e = molgrid.ExampleProvider()

In [4]: e.populate('single.types')

In [5]: e.next()
Out[5]: <molgrid.molgrid.Example at 0x7fb0e91685b0>
koshkabb commented 4 years ago

I am using Python 3.7.3 and the version of matplotlib is 3.2.1. (Previously I was using matplotlib 3.1.0 but I updated it thinking this might be the issue, but it wasn't). There is only one molgrid installed. :S

dkoes commented 4 years ago

Can you provide the full stack trace of the segfault?

koshkabb commented 4 years ago

Hello!

Yes, here it is. Forgot to mention I am running it in a docker container with Ubuntu 18.04.

#0  0x00007f15212e6a54 in std::ctype<wchar_t>::do_scan_is (this=0x7ffcea58daf0, __m=16672, __lo=0x20 <error: Cannot access memory at address 0x20>, 
    __hi=0x46 <error: Cannot access memory at address 0x46>) at ctype_members.cc:188
#1  0x00007f1521321a27 in std::__cxx11::numpunct<char>::grouping (this=0x7f1521c24120 <(anonymous namespace)::ctype_w>)
    at /tmp/gcc-6.2.0_build/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:1777
#2  std::__facet_shims::__numpunct_fill_cache<char> (f=f@entry=0x7f1521c24120 <(anonymous namespace)::ctype_w>, c=c@entry=0x55ce1e41bec0)
    at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++11/cxx11-shim_facets.cc:514
#3  0x00007f15212c3343 in std::__facet_shims::(anonymous namespace)::numpunct_shim<char>::numpunct_shim (c=0x55ce1e41bec0, f=0x7f1521c24120 <(anonymous namespace)::ctype_w>, this=0x55ce1f3718f0)
    at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++11/cxx11-shim_facets.cc:238
#4  std::locale::facet::_M_cow_shim (this=this@entry=0x7f1521c24120 <(anonymous namespace)::ctype_w>, which=<optimized out>)
    at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++11/cxx11-shim_facets.cc:797
#5  0x00007f15212b52fc in std::locale::_Impl::_M_install_facet (this=this@entry=0x55ce1f39d4d0, __idp=<optimized out>, __fp=0x7f1521c24120 <(anonymous namespace)::ctype_w>)
    at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++98/locale.cc:384
#6  0x00007f15212b546e in std::locale::_Impl::_M_replace_facet (this=this@entry=0x55ce1f39d4d0, __imp=__imp@entry=0x7f1521c24da0 <(anonymous namespace)::c_locale_impl>, __idp=<optimized out>)
    at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++98/locale.cc:308
#7  0x00007f15212b54a7 in std::locale::_Impl::_M_replace_category (this=0x55ce1f39d4d0, __imp=0x7f1521c24da0 <(anonymous namespace)::c_locale_impl>, 
    __idpp=0x7f1521b5a730 <std::locale::_Impl::_S_id_numeric+16>) at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++98/locale.cc:297
#8  0x00007f15212b3cec in std::locale::_Impl::_M_replace_categories (this=this@entry=0x55ce1f39d4d0, __imp=0x7f1521c24da0 <(anonymous namespace)::c_locale_impl>, __cat=__cat@entry=2)
    at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++98/localename.cc:336
#9  0x00007f15212b3e9b in std::locale::_M_coalesce (this=this@entry=0x7ffcea58dcd0, __base=..., __add=..., __cat=__cat@entry=2) at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++98/localename.cc:166
#10 0x00007f15212b3f59 in std::locale::locale (this=0x7ffcea58dcd0, __base=..., __s=<optimized out>, __cat=2) at ../../../../../gcc-6.2.0/libstdc++-v3/src/c++98/localename.cc:151
dkoes commented 4 years ago

What is your locale?

koshkabb commented 4 years ago

The output of locale is:

LANG=en_US.UTF-8
LANGUAGE=en_US.en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Actually, I don't know if it's related or if it helps, but I am having the same problem as the issue described in https://github.com/gnina/libmolgrid/issues/29 when single.types contains 0 protein.pdb ligand.sdf (with a label):

>>> import matplotlib
>>> import molgrid
>>> e = molgrid.ExampleProvider()
>>> e.populate('single.types')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Missing molecular data in line: 0 protein.pdb ligand.sdf

If single.types contains just protein.pdb ligand.sdf (no label), then there is no error in e.populate, but I get the segmentation fault in e.next().

dkoes commented 4 years ago

Can you provide a container that reproduces this problem?

koshkabb commented 4 years ago

Hi, yes, I've created a very simple dockerfile (can't upload it here, so I'm pasting the text):

FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04

RUN apt-get update && apt-get install -y --no-install-recommends \
        python3-pip \
        wget

RUN pip3 install molgrid
RUN pip3 install matplotlib
RUN pip3 install torch

RUN wget https://files.rcsb.org/download/6HJ2.pdb
RUN wget http://files.rcsb.org/ligands/view/P06_ideal.sdf

RUN echo "6HJ2.pdb P06_ideal.sdf" > single.types

When you access its terminal, just run python3 and:

import matplotlib
import molgrid
e = molgrid.ExampleProvider()
e.populate('single.types')
e.next()
koshkabb commented 4 years ago

I was just able to narrow it down even more to kiwisolver (matplotlib imports kiwisolver), which doesn't import anything else:

>>> import kiwisolver
>>> import molgrid
>>> e = molgrid.ExampleProvider()
>>> e.populate('single.types')
>>> example = e.next()
Segmentation fault
koshkabb commented 4 years ago

Hello, I hope you had a nice weekend. Just wanted to let you know that I've found a "silly" way of fixing the issue. If I do from torch import cuda , the segmentation fault doesn't happen:

>>> import kiwisolver
>>> from torch import cuda
>>> import molgrid
>>> fname = 'single.types'
>>> e = molgrid.ExampleProvider()
>>> e.populate(fname)
>>> e.next()
<molgrid.molgrid.Example object at 0x7ff59b1c0cb0>

Maybe this gives you a clue as to why the segmentation fault happened :S

dkoes commented 4 years ago

My assumption this is due to how the pip package is built where we are bundling in older libraries (including libstdc++) to maximize portability. There is probably some interaction between the bundled stdc++ and the system stdc regarding locales, but I haven't been able to figure it out yet.

Do you still have the problem if you build from source?

koshkabb commented 4 years ago

I've tried building from source but I got stuck - it can't find the openbabel header files..

dkoes commented 4 years ago

Okay, I think I may have figured it out. We need to statically link libstdc++ to have a manylinux compliant image since manylinux doesn't let you use a c++14 compatible libstdc++. However, it ends up conflicting with the dynamically loaded libstdc++ that kiwisolver pulls in, so it is necessary to hide the externally exported symbols from the static libstdc++.

Can you try testing out this version: https://test.pypi.org/project/molgrid/0.1.2/

koshkabb commented 4 years ago

Thanks a lot for the explanation! I feel like now I understand it a bit more :)

That version fixes it! Amazing! I'll keep an eye out for when it goes live.

dkoes commented 4 years ago

I've pushed a release - should be available through regular pip now. This release also adds an iterator interface for ExampleProvider (but be careful - it will iterate forever - next release I'll add a concept of epochs).

koshkabb commented 4 years ago

Awesome, thank you so much !!