pyscf / gpu4pyscf

A plugin to use Nvidia GPU in PySCF package
GNU General Public License v3.0
147 stars 25 forks source link

Segfault libxc.so #188

Closed markperri closed 3 months ago

markperri commented 4 months ago

I installed pyscf into my environment in a Jupyter Notebook docker container running ubuntu 22.04 and python 3.11

pip install pyscf gpu4pyscf-cuda12x cutensor-cu12

When I test with the given example I get a segfault:

import pyscf
from gpu4pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy

Segmentation fault

kernel: python[394296]: segfault at 0 ip 00007f506ab6fcff sp 00007ffdbc440778 error 6 in libxc.so[7f506ab65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

It looks like I have two libxc.so:

/opt/conda/lib/python3.11/site-packages/gpu4pyscf/lib/deps/lib/libxc.so
/opt/conda/lib/python3.11/site-packages/pyscf/lib/deps/lib/libxc.so
pip freeze | grep scf
gpu4pyscf-cuda12x==1.0
gpu4pyscf-libxc-cuda12x==0.4
pyscf==2.6.2
pyscf-dispersion==1.0.2

Do you have any thoughts on how to fix the segfault?

wxj6000 commented 4 months ago

The following things can be helpful to identify the issue:

  1. Run the following code to see if it is an issue related to libxc.so in PySCF or libxc.so (CUDA version) in gpu4pyscf.
    
    import pyscf
    from pyscf.dft import rks

atom =''' O 0.0000000000 -0.0000000000 0.1174000000 H -0.7570000000 -0.0000000000 -0.4696000000 H 0.7570000000 0.0000000000 -0.4696000000 '''

mol = pyscf.M(atom=atom, basis='def2-tzvpp') mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel() # compute total energy


2. What is your GPU type?
3. What is the message before Segmentation fault?
markperri commented 4 months ago

Thanks for the quick response.

  1. That code runs fine:

converged SCF energy = -75.2427927513195

  1. I am using an A100-40 on Jetstream2. It is sliced to 1/5 of a GPU in the hypervisor on this VM size. I also tried it on a g3.xl VM size, which uses the entire GPU, and got the same error.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-8C                  On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. There's no messages before that line, it's just Segmentation fault. I have to look in /var/log/messages to see the details. I'm not sure if that's due to running it in a docker container.
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

/var/log/messages:

kernel: python[445393]: segfault at 0 ip 00007f8cb2b6fcff sp 00007ffc13e36838 error 6 in libxc.so[7f8cb2b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

Thanks, Mark

wxj6000 commented 4 months ago

@markperri Thanks for the info. I tried to create a similar environment, but I was not able to reproduce the issue. If possible, could you please share your docker file?

And you probably have tried, but sometimes it is helpful to reinstall or create a new conda environment to avoid some possible conflict.

markperri commented 4 months ago

@wxj6000 Here is a minimal dockerfile that gives the same error. I wonder if there's something about the way this system is setup. I'll see if I can find another CUDA application to test the installation in general tomorrow. Thanks, Mark

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
    python3-dev \
    python3-pip \
    python3-wheel \
    python3-setuptools && \
    rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*

ENV CUDA_HOME="/usr/local/cuda" LD_LIBRARY_PATH="${CUDA_HOME}/lib64::${LD_LIBRARY_PATH}"
RUN echo "export PATH=${CUDA_HOME}/bin:\$PATH" >> /etc/bash.bashrc
RUN echo "export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:\$LD_LIBRARY_PATH" >> /etc/bash.bashrc

RUN pip3 install pyscf gpu4pyscf-cuda12x cutensor-cu12
root@23aed08bf45d:/# python3
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault (core dumped)

/var/log/messages:


python3[506069]: segfault at 0 ip 00007fa842b6fcff sp 00007ffc591d6028 error 6 in libxc.so[7fa842b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
``
markperri commented 4 months ago

@wxj6000 I ran a NAMD container from NVIDIA NGC and it runs fine on the GPU, so at least we know the docker / GPU setup is working. I'm not sure what else to test.

Fri Jul 26 14:40:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-40C                 On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |    672MiB / 40960MiB |     61%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11413      C   namd2                                       671MiB |
+---------------------------------------------------------------------------------------+
wxj6000 commented 4 months ago

@markperri I tried the docker file you provided. The docker container works fine on my side. Let me check if there is a memory leak in the modules.

Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>> 
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>> 
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> 
>>> e_dft = mf.kernel()  # compute total energy
/usr/local/lib/python3.10/dist-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
  jitify._init_module()
converged SCF energy = -75.2427927513248
>>> print(f"total energy = {e_dft}")
total energy = -75.24279275132476
>>> 
markperri commented 4 months ago

@wxj6000 I compiled gpu4pyscf from source and it still gives the same error. I'll contact the Jetstream2 staff and see if they have any ideas.

Thanks, Mark

wxj6000 commented 4 months ago

@markperri I went through the code related to libxc, and improved the interface related to memory allocation in libxc. But I am not sure if it is helpful on your side. https://github.com/pyscf/gpu4pyscf/actions/runs/10133763490/job/28019283314?pr=189

markperri commented 4 months ago

Thanks for trying. I compiled from source with 8fdfaa8, but I get the same segfault:

kernel: python[43743]: segfault at 0 ip 00007f3fad76ddf3 sp 00007ffeda1ba2c8 error 6 in libxc.so.15[7f3fad763000+224000]
kernel: Code: 00 00 00 75 05 48 83 c4 18 c3 e8 58 68 ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 8b 05 b5 b2 21 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
wxj6000 commented 3 months ago

@markperri Can you check if this PR resolves the issue please? https://github.com/pyscf/gpu4pyscf/pull/180

markperri commented 3 months ago

Thanks, is that the libxc_overhead branch? I installed it, but it doesn't seem to help:

pip install git+https://github.com/pyscf/gpu4pyscf.git@libxc_overhead
pip install cutensor-cu12

(base) jovyan@d67ddf22943d:/tmp$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault
wxj6000 commented 3 months ago

Right. It is the libxc_overhead branch. Just to confirm, have you removed the existing package if installed?

And I registered an account on ChemCompute. But I don't have the access to JupyterHub as I don't have academic emails anymore. Is there any chance to have a development environment for debugging?

markperri commented 3 months ago

Yes, this is without any gpu4pyscf installed.

markperri commented 3 months ago

Oh and @wxj6000 you should have Jupyter Notebook access now. Thanks, Mark

wxj6000 commented 3 months ago

@markperri Thank you for giving me the permission for debugging. It seems that the unified memory, which is required by libxc.so, is disabled on this device. Please checkout the managedMemory in the dict and CUDA documentation for the details. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

We can switch to libxc on CPU if the unified memory is not supported on the device. We will let you know the progress.

{'name': b'GRID A100X-8C', 'totalGlobalMem': 8585609216, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536, 'warpSize': 32, 'maxThreadsPerBlock': 1024, 'maxThreadsDim': (1024, 1024, 64), 'maxGridSize': (2147483647, 65535, 65535), 'clockRate': 1410000, 'totalConstMem': 65536, 'major': 8, 'minor': 0, 'textureAlignment': 512, 'texturePitchAlignment': 32, 'multiProcessorCount': 108, 'kernelExecTimeoutEnabled': 0, 'integrated': 0, 'canMapHostMemory': 1, 'computeMode': 0, 'maxTexture1D': 131072, 'maxTexture2D': (131072, 65536), 'maxTexture3D': (16384, 16384, 16384), 'concurrentKernels': 1, 'ECCEnabled': 1, 'pciBusID': 4, 'pciDeviceID': 0, 'pciDomainID': 0, 'tccDriver': 0, 'memoryClockRate': 1215000, 'memoryBusWidth': 5120, 'l2CacheSize': 41943040, 'maxThreadsPerMultiProcessor': 2048, 'isMultiGpuBoard': 0, 'cooperativeLaunch': 1, 'cooperativeMultiDeviceLaunch': 1, 'deviceOverlap': 1, 'maxTexture1DMipmap': 32768, 'maxTexture1DLinear': 268435456, 'maxTexture1DLayered': (32768, 2048), 'maxTexture2DMipmap': (32768, 32768), 'maxTexture2DLinear': (131072, 65000, 2097120), 'maxTexture2DLayered': (32768,32768, 2048), 'maxTexture2DGather': (32768, 32768), 'maxTexture3DAlt': (8192, 8192, 32768), 'maxTextureCubemap': 32768, 'maxTextureCubemapLayered': (32768, 2046), 'maxSurface1D': 32768, 'maxSurface1DLayered': (32768, 2048), 'maxSurface2D': (131072, 65536), 'maxSurface2DLayered': (32768, 32768, 2048), 'maxSurface3D': (16384, 16384, 16384), 'maxSurfaceCubemap': 32768,'maxSurfaceCubemapLayered': (32768, 2046), 'surfaceAlignment': 512, 'asyncEngineCount': 5, 'unifiedAddressing': 1, 'streamPrioritiesSupported': 1, 'globalL1CacheSupported': 1, 'localL1CacheSupported': 1, 'sharedMemPerMultiprocessor': 167936, 'regsPerMultiprocessor': 65536, 'managedMemory': 0, 'multiGpuBoardGroupID': 0, 'hostNativeAtomicSupported': 0, 'singleToDoublePrecisionPerfRatio': 2, 'pageableMemoryAccess': 0, 'concurrentManagedAccess': 0, 'computePreemptionSupported': 1, 'canUseHostPointerForRegisteredMem': 0, 'sharedMemPerBlockOptin': 166912, 'pageableMemoryAccessUsesHostPageTables': 0, 'directManagedMemAccessFromHost': 0, 'uuid': b'_:\x16\x9f_\xd6\x11\xef\xbex\x9d\x11\x11\x8e+\xa9', 'luid': b'', 'luidDeviceNodeMask': 0,'persistingL2CacheMaxSize': 26214400, 'maxBlocksPerMultiProcessor': 32, 'accessPolicyMaxWindowSize': 134213632, 'reservedSharedMemPerBlock': 1024}
markperri commented 3 months ago

Oh I see. The way their hypervisor works with vGPUs doesn't allow unified memory. Looks like this package won't be compatible with their system then. Thanks, Mark

wxj6000 commented 3 months ago

@markperri The issue has been fixed in v1.0.1. Most tasks can be executed on ChemCompute now. But, due to the limited memory of a slice of GPU, it may raise an error of out of memory for some tasks such as Hessian calculations.

Thank you for your feedback and your cooperation!

markperri commented 3 months ago

Thanks! It works great now. I increased the instance size to use the entire GPU and the out of memory problems are fixed. But, I had to install it from github. There is something wrong with the package on pypi. It just downloads all versions and then gives up.

(base) jovyan@7db95487cf10:/tmp$ pip install gpu4pyscf
Collecting gpu4pyscf
  Downloading gpu4pyscf-1.0.1.tar.gz (206 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.8/206.8 kB 6.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-1.0.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 205.0/205.0 kB 19.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz has inconsistent name: expected 'gpu4pyscf', but metadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-0.8.2.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.9/204.9 kB 13.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'

It continues to download older versions of gpu4pyscf and then errors out.

wxj6000 commented 3 months ago

@markperri pip3 install gpu4pyscf-cuda12x will resolve the issue.

markperri commented 3 months ago

Oh yes, sorry. Forgot that part!