cudaErrorMemoryAllocation with KOKKOS after long time

We just got a machine with P100 cards and I was going to run some long silicon carbon simulations over the summer (1 μs+) with vashishta/kk. I run every simulation one 1xP100 with 20June17 version of LAMMPS, gcc 5.4.0 and Cuda 8.0.

After a long time (1 million timesteps+, but actual number is not deterministic) the simulations crash with an error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaCreateTextureObject( & tex_obj , & resDesc, & texDesc, NULL ) error( cudaErrorMemoryAllocation): out of memory ../../lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:296
Traceback functionality not available

[bigfacet:07706] *** Process received signal ***
[bigfacet:07706] Signal: Aborted (6)
[bigfacet:07706] Signal code:  (-6)
[bigfacet:07706] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f18b685f390]
[bigfacet:07706] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f18b59f6428]
[bigfacet:07706] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f18b59f802a]
[bigfacet:07706] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7f18b655b84d]
[bigfacet:07706] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7f18b65596b6]
[bigfacet:07706] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d701)[0x7f18b6559701]
[bigfacet:07706] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d919)[0x7f18b6559919]
[bigfacet:07706] [ 7] lmp_kokkos_cuda_openmpi[0xf791f3]
[bigfacet:07706] [ 8] lmp_kokkos_cuda_openmpi[0xf81750]
[bigfacet:07706] [ 9] lmp_kokkos_cuda_openmpi[0xf7bf9b]
[bigfacet:07706] [10] lmp_kokkos_cuda_openmpi[0x4e3d5c]
[bigfacet:07706] [11] lmp_kokkos_cuda_openmpi[0xf23356]
[bigfacet:07706] [12] lmp_kokkos_cuda_openmpi[0xf03c9d]
[bigfacet:07706] [13] lmp_kokkos_cuda_openmpi[0x929761]
[bigfacet:07706] [14] lmp_kokkos_cuda_openmpi[0x57ce79]
[bigfacet:07706] [15] lmp_kokkos_cuda_openmpi[0x57b13f]
[bigfacet:07706] [16] lmp_kokkos_cuda_openmpi[0x57bca7]
[bigfacet:07706] [17] lmp_kokkos_cuda_openmpi[0x422106]
[bigfacet:07706] [18] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f18b59e1830]
[bigfacet:07706] [19] lmp_kokkos_cuda_openmpi[0x425b69]
[bigfacet:07706] *** End of error message ***

I suspected this could be the vashishta implementation, so I tried to run the sw benchmark (attached modified version) which also crashed after 21 million timesteps.

I'm not sure how to debug this, but since it crashes during cudaCreateTextureObject I suspect this is during short neighbor list allocation since it uses textures due to its random access pattern (or regular neighbor list).

I then noticed that we actually reallocate the short neighbor list every timestep nlocal+nghosts changes (which is quite often, after every neighbor list build I suppose), which probably isn't needed. I'm currently running a simulation where I only reallocate short neighbor list when it is smaller than what is needed. If this does not crash I'm closer to understanding.

nvidia-smi does not show any noticable increased memory usage during the simulation. It could be a CUDA bug where heavy reallocation somehow fragments memory?

Any ideas?

kokkos_memory_bug.zip

ovilab / atomify

cudaErrorMemoryAllocation with KOKKOS after long time #345