Closed joaander closed 3 years ago
This script, reported in the mailing list https://mail.google.com/mail/u/0/#label/hoomd-users/FMfcgxwJXxmLdMSQfBnXzNgGMjQnlzld also triggers the error.
import hoomd
from hoomd import md
import numpy as np
import gc
import sys
seed = 0
np.random.seed(seed)
def initialize_model():
# Well-studied modified Kob-Andersen interactions. (The model is NOT important for the error).
epsilonAA = 1.0; sigmaAA = 1.0; r_cutAA = 1.5
epsilonAB = 1.5; sigmaAB = 0.8; r_cutAB = 2.0
epsilonBB = 0.5; sigmaBB = 0.88; r_cutBB = 1.5
nl = md.nlist.cell()
lj = md.pair.force_shifted_lj(nlist=nl, r_cut=r_cutAA)
lj.pair_coeff.set('A', 'A', sigma=sigmaAA, epsilon=epsilonAA, r_cut=r_cutAA)
lj.pair_coeff.set('A', 'B', sigma=sigmaAB, epsilon=epsilonAB, r_cut=r_cutAB)
lj.pair_coeff.set('B', 'B', sigma=sigmaBB, epsilon=epsilonBB, r_cut=r_cutBB)
# NVT parameters.
kT = 0.05
dt = 0.005
time_steps_nvt = 1 # The error is sensitive to this value, in combination with dt, very curiously.
tau_thermostat = 50 * dt
# FIRE parameters.
dt_fire = 0.1 * dt
# For creating an FCC lattice.
# Density ends up being 1.2, a well studied value for this model.
ncells = 20 # Does NOT happen for smaller systems (ncells = 10).
lattice_constant = 1.415030791462591
for serial in range(100):
print("starting with %d" % serial)
hoomd.context.initialize("--mode=gpu --notice-level=1 --gpu_error_checking")
# HOOMD-blue is running on the following GPU(s):
# [0] Quadro P2000 8 SM_6.1 @ 1.48 GHz, 5044 MiB DRAM, DIS, MNG
#
# But it also happens on
#
# HOOMD-blue is running on the following GPU(s):
# [0] GeForce GTX 780 Ti 15 SM_3.5 @ 0.928 GHz, 3021 MiB DRAM
system = hoomd.init.create_lattice(unitcell=hoomd.lattice.fcc(a=lattice_constant), n=[ncells,ncells,ncells]);
initialize_model()
# Create concentration of vacancies.
num_to_remove = 4800
N_orig = 4*ncells**3
tags = np.arange(N_orig)
np.random.shuffle(tags)
tags_to_remove = tags[:num_to_remove]
for i in range(num_to_remove):
system.particles.remove(tags_to_remove[i])
# NVT.
md.integrate.mode_standard(dt=dt)
nvt = hoomd.md.integrate.nvt(group=hoomd.group.all(), kT=kT, tau=tau_thermostat)
hoomd.run(time_steps_nvt, quiet=False)
nvt.disable()
# Dump state just before quench.
state_file_before_quench = "before_quench.gsd"
dump = hoomd.dump.gsd(state_file_before_quench, period=None, group=hoomd.group.all(), overwrite=True);
# Reinitialize.
gc.collect()
hoomd.context.initialize("--mode=gpu --notice-level=1 --gpu_error_checking")
snapshot = hoomd.data.gsd_snapshot(state_file_before_quench, frame=0)
system = hoomd.init.read_snapshot(snapshot)
initialize_model()
# FIRE.
fire = hoomd.md.integrate.mode_minimize_fire(dt=dt_fire, ftol=1e-3, Etol=1)
nve = hoomd.md.integrate.nve(group=hoomd.group.all())
hoomd.run(1, quiet=False) # Time steps don't matter for the error.
When run in cuda-gdb
, I get this:
[Switching focus to CUDA kernel 99, grid 2133, block (0,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0]
0x0000000004f586d0 in gpu_fire_reduce_partial_sum_kernel<<<(1,1,1),(256,1,1)>>> ()
There are several errors in the FIRE code:
1) FIRE does not account for changing number of particles.
2) FIRE determines the size of m_partial_sum1
on construction, before any integration methods are defined.
3) gpu_fire_reduce_partial_sum_kernel<<< grid, threads
should be gpu_fire_reduce_partial_sum_kernel<<< dim3(1,1,1), threads
(2) is the root cause for these memory errors. It appears that this code has not functioned properly.
These issues should be fixed in the first v3.0 beta release that includes FIRE.
Description
FIRE energy minimzer results in CUDA illegal memory error.
Script
Output
Expected output
The script should run without producing an error.
Configuration
collins, Titan V (gpu 0)
Versions:
Developer
Any ideas, @jglaser ?