glotzerlab / hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.
http://glotzerlab.engin.umich.edu/hoomd-blue
BSD 3-Clause "New" or "Revised" License
343 stars 132 forks source link

Segmentation fault for MPI operations on large data sizes #1895

Closed mphoward closed 2 days ago

mphoward commented 2 months ago

Description

MPCD segfaults for large system sizes:

Our working guess is that this is caused by HOOMD's MPI helper methods like scatter_v and gather_v accidentally overflowing the MPI count (signed int) when data gets serialized to bytes.

I propose to register MPI_Datatypes for common data types we use these methods on that don't need to be serialized (like Scalar3). I would register these types in the MPIConfiguration and provide getters to access them. Then, callers can either invoke the MPI operations they want directly, or we can provide helper methods using these types (likely in the MPIConfiguration class as well).

We should also, at minimum, add a check in the MPI helper methods and throw an exception if the serialized data is expected to overflow a signed int.

The script below was user reported to cause segfaults on 48 CPUs.

Script

import hoomd
import numpy

device = hoomd.device.CPU()
simulation = hoomd.Simulation(device=device, seed=1)

snapshot = hoomd.Snapshot()
L = 128
density = 50
kT = 1

if snapshot.communicator.rank == 0:
    rng = numpy.random.default_rng(seed=42)
    snapshot.configuration.box = [L,L,L,0,0,0]
    snapshot.mpcd.types = ['A']
    snapshot.mpcd.N = int(density * L * L * L)
    snapshot.mpcd.position[:] = rng.uniform(low=-0.5*L,high=0.5*L,size=(snapshot.mpcd.N,3))

    velocity = rng.normal(0.0, numpy.sqrt(kT), (snapshot.mpcd.N, 3))
    velocity -= numpy.mean(velocity, axis=0)
    snapshot.mpcd.velocity[:] = velocity

simulation.create_state_from_snapshot(snapshot)

integrator = hoomd.mpcd.Integrator(dt=0.02)
integrator.collision_method = hoomd.mpcd.collide.StochasticRotationDynamics(
    period=1, angle=130, kT=kT
)

integrator.streaming_method = hoomd.mpcd.stream.Bulk(
    period=integrator.collision_method.period
)

integrator.mpcd_particle_sorter = hoomd.mpcd.tune.ParticleSorter(trigger=20)
simulation.operations.integrator = integrator

simulation.run(100)
device.notice(f'{simulation.tps}')

Input files

No response

Output

Segmentation fault

Expected output

No response

Platform

CPU, GPU, Linux

Installation method

Compiled from source

HOOMD-blue version

4.8.2

Python version

3.12

joaander commented 2 months ago

Thanks for thinking through this. I look forward to the pull request.