CUDA memory errors in FIRE

joaander commented 4 years ago

Description

FIRE energy minimzer results in CUDA illegal memory error.

Script

# Include a minimal script that reproduces the problem
import hoomd
from hoomd import md
import numpy as np

hoomd.context.initialize("--mode=gpu --gpu=0 --gpu_error_checking")

system = hoomd.init.create_lattice(unitcell=hoomd.lattice.sc(a=0.9), n=40) #a=2.0 does not give error

dt = 0.0025
NeighborsListLJ = md.nlist.cell()
lj = hoomd.md.pair.lj(r_cut=2.5, nlist=NeighborsListLJ)
lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)
md.integrate.mode_standard(dt=dt)
integrator_nve = md.integrate.nve(group=hoomd.group.all())
for batch in range(100):
    #run NVE
    hoomd.run(1000)
    snap = system.take_snapshot()
    #run fire
    fire=hoomd.md.integrate.mode_minimize_fire(dt=dt, ftol=1e-5, Etol=1e-10)
    hoomd.run(100,quiet=True) #we don't need to fully minimize to get the error
    system.restore_snapshot(snap)
    md.integrate.mode_standard(dt=dt)

Output

**ERROR**: an illegal memory access was encountered before /hoomd/md/FIREEnergyMinimizerGPU.cc:178
Traceback (most recent call last):                                       
  File "test.py", line 21, in <module>

Expected output

The script should run without producing an error.

Configuration

collins, Titan V (gpu 0)

Versions:

Python version: 3.8.1
HOOMD-blue version: 2.9.0

Developer

Any ideas, @jglaser ?

joaander commented 4 years ago

This script, reported in the mailing list https://mail.google.com/mail/u/0/#label/hoomd-users/FMfcgxwJXxmLdMSQfBnXzNgGMjQnlzld also triggers the error.

import hoomd
from hoomd import md
import numpy as np
import gc
import sys

seed = 0 
np.random.seed(seed)

def initialize_model():
    # Well-studied modified Kob-Andersen interactions. (The model is NOT important for the error).
    epsilonAA = 1.0; sigmaAA = 1.0; r_cutAA = 1.5
    epsilonAB = 1.5; sigmaAB = 0.8; r_cutAB = 2.0
    epsilonBB = 0.5; sigmaBB = 0.88; r_cutBB = 1.5

    nl = md.nlist.cell()
    lj = md.pair.force_shifted_lj(nlist=nl, r_cut=r_cutAA)
    lj.pair_coeff.set('A', 'A', sigma=sigmaAA, epsilon=epsilonAA, r_cut=r_cutAA)
    lj.pair_coeff.set('A', 'B', sigma=sigmaAB, epsilon=epsilonAB, r_cut=r_cutAB)
    lj.pair_coeff.set('B', 'B', sigma=sigmaBB, epsilon=epsilonBB, r_cut=r_cutBB)

# NVT parameters.
kT = 0.05
dt = 0.005
time_steps_nvt = 1   # The error is sensitive to this value, in combination with dt, very curiously.
tau_thermostat = 50 * dt   

# FIRE parameters.
dt_fire = 0.1 * dt

# For creating an FCC lattice.
# Density ends up being 1.2, a well studied value for this model.
ncells = 20     # Does NOT happen for smaller systems (ncells = 10).
lattice_constant = 1.415030791462591

for serial in range(100):
    print("starting with %d" % serial)
    hoomd.context.initialize("--mode=gpu --notice-level=1 --gpu_error_checking")
    # HOOMD-blue is running on the following GPU(s):
    # [0]          Quadro P2000   8 SM_6.1 @ 1.48 GHz, 5044 MiB DRAM, DIS, MNG
    # 
    # But it also happens on
    #
    # HOOMD-blue is running on the following GPU(s):
    # [0]    GeForce GTX 780 Ti  15 SM_3.5 @ 0.928 GHz, 3021 MiB DRAM

    system = hoomd.init.create_lattice(unitcell=hoomd.lattice.fcc(a=lattice_constant), n=[ncells,ncells,ncells]);

    initialize_model()

    # Create concentration of vacancies. 
    num_to_remove = 4800
    N_orig = 4*ncells**3
    tags = np.arange(N_orig)
    np.random.shuffle(tags)
    tags_to_remove = tags[:num_to_remove]
    for i in range(num_to_remove):
        system.particles.remove(tags_to_remove[i])

    # NVT.
    md.integrate.mode_standard(dt=dt)
    nvt = hoomd.md.integrate.nvt(group=hoomd.group.all(), kT=kT, tau=tau_thermostat)
    hoomd.run(time_steps_nvt, quiet=False)
    nvt.disable()

    # Dump state just before quench.
    state_file_before_quench = "before_quench.gsd"
    dump = hoomd.dump.gsd(state_file_before_quench, period=None, group=hoomd.group.all(), overwrite=True);
    # Reinitialize.
    gc.collect()
    hoomd.context.initialize("--mode=gpu --notice-level=1 --gpu_error_checking")
    snapshot = hoomd.data.gsd_snapshot(state_file_before_quench, frame=0)
    system = hoomd.init.read_snapshot(snapshot)
    initialize_model()

    # FIRE.
    fire = hoomd.md.integrate.mode_minimize_fire(dt=dt_fire, ftol=1e-3, Etol=1)
    nve = hoomd.md.integrate.nve(group=hoomd.group.all())
    hoomd.run(1, quiet=False) # Time steps don't matter for the error.

When run in cuda-gdb, I get this:

[Switching focus to CUDA kernel 99, grid 2133, block (0,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0]
0x0000000004f586d0 in gpu_fire_reduce_partial_sum_kernel<<<(1,1,1),(256,1,1)>>> ()

joaander commented 4 years ago

There are several errors in the FIRE code: 1) FIRE does not account for changing number of particles. 2) FIRE determines the size of m_partial_sum1 on construction, before any integration methods are defined. 3) gpu_fire_reduce_partial_sum_kernel<<< grid, threads should be gpu_fire_reduce_partial_sum_kernel<<< dim3(1,1,1), threads

(2) is the root cause for these memory errors. It appears that this code has not functioned properly.

joaander commented 4 years ago

These issues should be fixed in the first v3.0 beta release that includes FIRE.

glotzerlab / hoomd-blue