Closed markperri closed 3 months ago
The following things can be helpful to identify the issue:
import pyscf
from pyscf.dft import rks
atom =''' O 0.0000000000 -0.0000000000 0.1174000000 H -0.7570000000 -0.0000000000 -0.4696000000 H 0.7570000000 0.0000000000 -0.4696000000 '''
mol = pyscf.M(atom=atom, basis='def2-tzvpp') mf = rks.RKS(mol, xc='LDA').density_fit()
e_dft = mf.kernel() # compute total energy
2. What is your GPU type?
3. What is the message before Segmentation fault?
Thanks for the quick response.
converged SCF energy = -75.2427927513195
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100X-8C On | 00000000:04:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 8192MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O 0.0000000000 -0.0000000000 0.1174000000
... H -0.7570000000 -0.0000000000 -0.4696000000
... H 0.7570000000 0.0000000000 -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> e_dft = mf.kernel() # compute total energy
Segmentation fault
/var/log/messages:
kernel: python[445393]: segfault at 0 ip 00007f8cb2b6fcff sp 00007ffc13e36838 error 6 in libxc.so[7f8cb2b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
Thanks, Mark
@markperri Thanks for the info. I tried to create a similar environment, but I was not able to reproduce the issue. If possible, could you please share your docker file?
And you probably have tried, but sometimes it is helpful to reinstall or create a new conda environment to avoid some possible conflict.
@wxj6000 Here is a minimal dockerfile that gives the same error. I wonder if there's something about the way this system is setup. I'll see if I can find another CUDA application to test the installation in general tomorrow. Thanks, Mark
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
RUN apt-get update -y && \
apt-get install -y --no-install-recommends \
python3-dev \
python3-pip \
python3-wheel \
python3-setuptools && \
rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
ENV CUDA_HOME="/usr/local/cuda" LD_LIBRARY_PATH="${CUDA_HOME}/lib64::${LD_LIBRARY_PATH}"
RUN echo "export PATH=${CUDA_HOME}/bin:\$PATH" >> /etc/bash.bashrc
RUN echo "export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:\$LD_LIBRARY_PATH" >> /etc/bash.bashrc
RUN pip3 install pyscf gpu4pyscf-cuda12x cutensor-cu12
root@23aed08bf45d:/# python3
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O 0.0000000000 -0.0000000000 0.1174000000
... H -0.7570000000 -0.0000000000 -0.4696000000
... H 0.7570000000 0.0000000000 -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel() # compute total energy
Segmentation fault (core dumped)
/var/log/messages:
python3[506069]: segfault at 0 ip 00007fa842b6fcff sp 00007ffc591d6028 error 6 in libxc.so[7fa842b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
``
@wxj6000 I ran a NAMD container from NVIDIA NGC and it runs fine on the GPU, so at least we know the docker / GPU setup is working. I'm not sure what else to test.
Fri Jul 26 14:40:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100X-40C On | 00000000:04:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 672MiB / 40960MiB | 61% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 11413 C namd2 671MiB |
+---------------------------------------------------------------------------------------+
@markperri I tried the docker file you provided. The docker container works fine on my side. Let me check if there is a memory leak in the modules.
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O 0.0000000000 -0.0000000000 0.1174000000
... H -0.7570000000 -0.0000000000 -0.4696000000
... H 0.7570000000 0.0000000000 -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel() # compute total energy
/usr/local/lib/python3.10/dist-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
jitify._init_module()
converged SCF energy = -75.2427927513248
>>> print(f"total energy = {e_dft}")
total energy = -75.24279275132476
>>>
@wxj6000 I compiled gpu4pyscf from source and it still gives the same error. I'll contact the Jetstream2 staff and see if they have any ideas.
Thanks, Mark
@markperri I went through the code related to libxc, and improved the interface related to memory allocation in libxc. But I am not sure if it is helpful on your side. https://github.com/pyscf/gpu4pyscf/actions/runs/10133763490/job/28019283314?pr=189
Thanks for trying. I compiled from source with 8fdfaa8, but I get the same segfault:
kernel: python[43743]: segfault at 0 ip 00007f3fad76ddf3 sp 00007ffeda1ba2c8 error 6 in libxc.so.15[7f3fad763000+224000]
kernel: Code: 00 00 00 75 05 48 83 c4 18 c3 e8 58 68 ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 8b 05 b5 b2 21 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
@markperri Can you check if this PR resolves the issue please? https://github.com/pyscf/gpu4pyscf/pull/180
Thanks, is that the libxc_overhead branch? I installed it, but it doesn't seem to help:
pip install git+https://github.com/pyscf/gpu4pyscf.git@libxc_overhead
pip install cutensor-cu12
(base) jovyan@d67ddf22943d:/tmp$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O 0.0000000000 -0.0000000000 0.1174000000
... H -0.7570000000 -0.0000000000 -0.4696000000
... H 0.7570000000 0.0000000000 -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel() # compute total energy
Segmentation fault
Right. It is the libxc_overhead
branch. Just to confirm, have you removed the existing package if installed?
And I registered an account on ChemCompute. But I don't have the access to JupyterHub as I don't have academic emails anymore. Is there any chance to have a development environment for debugging?
Yes, this is without any gpu4pyscf installed.
Oh and @wxj6000 you should have Jupyter Notebook access now. Thanks, Mark
@markperri Thank you for giving me the permission for debugging. It seems that the unified memory, which is required by libxc.so, is disabled on this device. Please checkout the managedMemory
in the dict and CUDA documentation for the details. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
We can switch to libxc on CPU if the unified memory is not supported on the device. We will let you know the progress.
{'name': b'GRID A100X-8C', 'totalGlobalMem': 8585609216, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536, 'warpSize': 32, 'maxThreadsPerBlock': 1024, 'maxThreadsDim': (1024, 1024, 64), 'maxGridSize': (2147483647, 65535, 65535), 'clockRate': 1410000, 'totalConstMem': 65536, 'major': 8, 'minor': 0, 'textureAlignment': 512, 'texturePitchAlignment': 32, 'multiProcessorCount': 108, 'kernelExecTimeoutEnabled': 0, 'integrated': 0, 'canMapHostMemory': 1, 'computeMode': 0, 'maxTexture1D': 131072, 'maxTexture2D': (131072, 65536), 'maxTexture3D': (16384, 16384, 16384), 'concurrentKernels': 1, 'ECCEnabled': 1, 'pciBusID': 4, 'pciDeviceID': 0, 'pciDomainID': 0, 'tccDriver': 0, 'memoryClockRate': 1215000, 'memoryBusWidth': 5120, 'l2CacheSize': 41943040, 'maxThreadsPerMultiProcessor': 2048, 'isMultiGpuBoard': 0, 'cooperativeLaunch': 1, 'cooperativeMultiDeviceLaunch': 1, 'deviceOverlap': 1, 'maxTexture1DMipmap': 32768, 'maxTexture1DLinear': 268435456, 'maxTexture1DLayered': (32768, 2048), 'maxTexture2DMipmap': (32768, 32768), 'maxTexture2DLinear': (131072, 65000, 2097120), 'maxTexture2DLayered': (32768,32768, 2048), 'maxTexture2DGather': (32768, 32768), 'maxTexture3DAlt': (8192, 8192, 32768), 'maxTextureCubemap': 32768, 'maxTextureCubemapLayered': (32768, 2046), 'maxSurface1D': 32768, 'maxSurface1DLayered': (32768, 2048), 'maxSurface2D': (131072, 65536), 'maxSurface2DLayered': (32768, 32768, 2048), 'maxSurface3D': (16384, 16384, 16384), 'maxSurfaceCubemap': 32768,'maxSurfaceCubemapLayered': (32768, 2046), 'surfaceAlignment': 512, 'asyncEngineCount': 5, 'unifiedAddressing': 1, 'streamPrioritiesSupported': 1, 'globalL1CacheSupported': 1, 'localL1CacheSupported': 1, 'sharedMemPerMultiprocessor': 167936, 'regsPerMultiprocessor': 65536, 'managedMemory': 0, 'multiGpuBoardGroupID': 0, 'hostNativeAtomicSupported': 0, 'singleToDoublePrecisionPerfRatio': 2, 'pageableMemoryAccess': 0, 'concurrentManagedAccess': 0, 'computePreemptionSupported': 1, 'canUseHostPointerForRegisteredMem': 0, 'sharedMemPerBlockOptin': 166912, 'pageableMemoryAccessUsesHostPageTables': 0, 'directManagedMemAccessFromHost': 0, 'uuid': b'_:\x16\x9f_\xd6\x11\xef\xbex\x9d\x11\x11\x8e+\xa9', 'luid': b'', 'luidDeviceNodeMask': 0,'persistingL2CacheMaxSize': 26214400, 'maxBlocksPerMultiProcessor': 32, 'accessPolicyMaxWindowSize': 134213632, 'reservedSharedMemPerBlock': 1024}
Oh I see. The way their hypervisor works with vGPUs doesn't allow unified memory. Looks like this package won't be compatible with their system then. Thanks, Mark
@markperri The issue has been fixed in v1.0.1. Most tasks can be executed on ChemCompute now. But, due to the limited memory of a slice of GPU, it may raise an error of out of memory for some tasks such as Hessian calculations.
Thank you for your feedback and your cooperation!
Thanks! It works great now. I increased the instance size to use the entire GPU and the out of memory problems are fixed. But, I had to install it from github. There is something wrong with the package on pypi. It just downloads all versions and then gives up.
(base) jovyan@7db95487cf10:/tmp$ pip install gpu4pyscf
Collecting gpu4pyscf
Downloading gpu4pyscf-1.0.1.tar.gz (206 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.8/206.8 kB 6.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'
Downloading gpu4pyscf-1.0.tar.gz (204 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 205.0/205.0 kB 19.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz has inconsistent name: expected 'gpu4pyscf', but metadata has 'gpu4pyscf-cuda12x'
Downloading gpu4pyscf-0.8.2.tar.gz (204 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.9/204.9 kB 13.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'
It continues to download older versions of gpu4pyscf and then errors out.
@markperri pip3 install gpu4pyscf-cuda12x
will resolve the issue.
Oh yes, sorry. Forgot that part!
I installed pyscf into my environment in a Jupyter Notebook docker container running ubuntu 22.04 and python 3.11
pip install pyscf gpu4pyscf-cuda12x cutensor-cu12
When I test with the given example I get a segfault:
Segmentation fault
It looks like I have two libxc.so:
Do you have any thoughts on how to fix the segfault?