glotzerlab / hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.
http://glotzerlab.engin.umich.edu/hoomd-blue
BSD 3-Clause "New" or "Revised" License
323 stars 127 forks source link

Segfault in HPMC unit tests on Crusher. #1479

Closed joaander closed 11 months ago

joaander commented 1 year ago

Description

HOOMD segfaults when running the HPMC unit tests on crusher. To reproduce, build HOOMD using the crusher environment (https://github.com/glotzerlab/software/pull/263) in RelWithDebInfo mode (the segfault doesn't occur in Debug mode).

Script

$ gdb python
(gdb) run -u -m pytest -v -x -ra hoomd/hpmc/pytest
(gdb) bt

Input files

No response

Output

hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-3-constructor_args1] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-3-constructor_args2] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-3-constructor_args3] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-4-constructor_args0] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-4-constructor_args1] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-4-constructor_args2] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoid-4-constructor_args3] PASSED                                                     [  5%]
hoomd/hpmc/pytest/test_clusters.py::test_valid_construction_and_attach[CPU-FacetedEllipsoidUnion-1-constructor_args0] 
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffdc24e2629 in hoomd::ManagedArray<float>::ManagedArray (this=<optimized out>) at /ccs/home/anderjo/devel/hoomd/hoomd/ManagedArray.h:34
34          : data(nullptr), ptr(nullptr), N(0), managed(0), align(0), allocation_ptr(nullptr),
Missing separate debuginfos, use: zypper install krb5-debuginfo-1.19.2-150300.8.3.2.x86_64 libbz2-1-debuginfo-1.0.6-5.9.1.x86_64 libcom_err2-debuginfo-1.43.8-150000.4.33.1.x86_64 libcrypt1-debuginfo-4.4.15-150300.4.4.3.x86_64 libdrm2-debuginfo-2.4.104-1.12.x86_64 libdrm_amdgpu1-debuginfo-2.4.104-1.12.x86_64 libelf1-debuginfo-0.168-4.5.3.x86_64 libffi7-debuginfo-3.2.1.git259-10.8.x86_64 libidn2-0-debuginfo-2.2.0-3.6.1.x86_64 libkeyutils1-debuginfo-1.6.3-5.6.1.x86_64 libldap-2_4-2-debuginfo-2.4.46-150200.14.8.1.x86_64 liblzma5-debuginfo-5.2.3-150000.4.7.1.x86_64 libncurses6-debuginfo-6.1-5.9.1.x86_64 libnghttp2-14-debuginfo-1.40.0-6.1.x86_64 libnuma1-debuginfo-2.0.14.20.g4ee5e0c-10.1.x86_64 libopenssl1_1-debuginfo-1.1.1d-150200.11.51.1.x86_64 libpcre1-debuginfo-8.45-150000.20.13.1.x86_64 libselinux1-debuginfo-3.0-1.31.x86_64 libsqlite3-0-debuginfo-3.36.0-3.12.1.x86_64 libssh4-debuginfo-0.8.7-10.12.1.x86_64 libunistring2-debuginfo-0.9.10-1.1.x86_64 libyaml-0-2-debuginfo-0.1.7-1.17.x86_64 libz1-debuginfo-1.2.11-150000.3.30.1.x86_64

(gdb) bt
#0  0x00007ffdc24e2629 in hoomd::ManagedArray<float>::ManagedArray (this=<optimized out>) at /ccs/home/anderjo/devel/hoomd/hoomd/ManagedArray.h:34
#1  hoomd::hpmc::detail::PolyhedronVertices::PolyhedronVertices (this=<optimized out>) at /ccs/home/anderjo/devel/hoomd/hoomd/hpmc/ShapeConvexPolyhedron.h:47
#2  hoomd::hpmc::detail::FacetedEllipsoidParams::FacetedEllipsoidParams (this=<optimized out>) at /ccs/home/anderjo/devel/hoomd/hoomd/hpmc/ShapeFacetedEllipsoid.h:44
#3  hoomd::detail::managed_allocator<hoomd::hpmc::detail::FacetedEllipsoidParams>::allocate_construct_aligned (n=<optimized out>, use_device=<optimized out>, 
    align_size=<optimized out>, allocation_bytes=@0x7ffffffef528: 1472, allocation_ptr=@0x7ffffffef520: 0x3de44b0) at /ccs/home/anderjo/devel/hoomd/hoomd/managed_allocator.h:159
#4  0x00007ffdc24dec0e in hoomd::ManagedArray<hoomd::hpmc::detail::FacetedEllipsoidParams>::allocate (this=0x7ffffffef500)
    at /ccs/home/anderjo/devel/hoomd/hoomd/ManagedArray.h:291
#5  hoomd::ManagedArray<hoomd::hpmc::detail::FacetedEllipsoidParams>::operator= (this=0x7ffffffef500, other=...) at /ccs/home/anderjo/devel/hoomd/hoomd/ManagedArray.h:150
#6  hoomd::hpmc::detail::ShapeUnionParams<hoomd::hpmc::ShapeFacetedEllipsoid>::ShapeUnionParams (this=0x7ffffffef280, v=..., managed=false)
    at /ccs/home/anderjo/devel/hoomd/hoomd/hpmc/ShapeUnion.h:154
#7  0x00007ffdc29ab638 in hoomd::hpmc::IntegratorHPMCMono<hoomd::hpmc::ShapeUnion<hoomd::hpmc::ShapeFacetedEllipsoid> >::setShape (this=0x3e7d1c0, typ=..., v=...)
    at /ccs/home/anderjo/devel/hoomd/hoomd/hpmc/IntegratorHPMCMono.h:1348

Expected output

All tests to pass.

Platform

GPU, Linux

Installation method

Compiled from source

HOOMD-blue version

3.8.1

Python version

3.9.13

joaander commented 1 year ago

I can fix this segfault using the gnu development environment:

module load PrgEnv-gnu

There is still a memory issue with HPMC. When running the HPMC pytest suite, I get:

pytest/test_clusters.py::test_valid_setattr_attached[GPU-Sphinx-8-pivot_move_probability-0.2] PASSED
*Warning*: Falling back on CPU. No GPU implementation for shape.
pytest/test_clusters.py::test_valid_setattr_attached[GPU-Sphinx-8-pivot_move_probability-0.5] PASSED
*Warning*: Falling back on CPU. No GPU implementation for shape.
pytest/test_clusters.py::test_valid_setattr_attached[GPU-Sphinx-8-pivot_move_probability-0.8] PASSED
:0:rocdevice.cpp            :2614: 528277422298 us: 5289 : [tid:0x7fff2828f700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Fatal Python error: Aborted