device_pinned_malloc fails consistently on Jureca

kostrzewa commented 7 years ago

On Jureca, for some reason, I see device_pinned_malloc failing consistently like so:

# QUDA: ERROR: Failed to allocate device memory (cuda_color_spinor_field.cu:605 in allocateGhostBuffer())
# QUDA: ERROR: Aborting (rank 0, host jrc0029, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x24,aux=vol=165888,stride=165888,precision=8,TwistFlavour=-1,inver

I've increased the number of used devices to make sure that this is not just an "out of memory" issue, but a 48^3 lattice should fit without problems on 16 K80s... Have you observed anything similar elsewhere?

kostrzewa commented 7 years ago

The issue persists for Wilson quarks:

# QUDA: Creating a DiracWilsonPC operator
# QUDA: Creating a DiracWilsonPC operator
# QUDA: Creating a DiracWilsonPC operator
# QUDA: Tuned block=(32,1,1), grid=(1,1,1), shared=0,  giving 0.00 Gflop/s, 7.68 GB/s for cudaMemcpyHostToDevice with loadSpinorField,cuda_color_spinor_field.cu:530
# QUDA: Tuned block=(32,1,1), shared=0,  giving 0.00 Gflop/s, 106.95 GB/s for N4quda10PackSpinorIddLi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEENS_19ChiralToNonRelBasisIddLi4ELi3EEEEE with out_stride=663552,in_stride=663552
# QUDA: Tuned block=(1024,1,1), grid=(26,2,1), shared=0,  giving 44.01 Gflop/s, 176.03 GB/s for N4quda4blas5Norm2Id7double2S2_EE with vol=1327104,stride=663552,precision=8,TwistFlavour=-1
# QUDA: Source: CPU = 2.5478e+08, CUDA copy = 2.5478e+08
# QUDA: Solution: CPU = 0, CUDA copy = 0
# QUDA: ERROR: Failed to allocate device memory (cuda_color_spinor_field.cu:605 in allocateGhostBuffer())
# QUDA: ERROR: Aborting (rank 0, host jrc0002, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: ERROR: Aborting (rank 4, host jrc0003, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda4blas5Norm2Id7double2S2_EE,volume=24x48x48x24,aux=vol=1327104,stride=663552,precision=8,TwistFlavour=-1)
# QUDA: ERROR: Aborting (rank 2, host jrc0002, malloc.cpp:195 in device_pinned_malloc_())

mathiaswagner commented 7 years ago

Can you share QUDA, nvcc, CUDA driver versions for reference? Can it be reproduced with one of the quda internal tests?

kostrzewa commented 7 years ago

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

Mon Nov 28 16:34:50 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:06:00.0     Off |                    0 |
| N/A   26C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:07:00.0     Off |                    0 |
| N/A   31C    P8    28W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:86:00.0     Off |                    0 |
| N/A   31C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:87:00.0     Off |                    0 |
| N/A   26C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

QUDA develop branch, latest commit:

commit 5dc81648f268a8c9c400619780dab82ac90d7c3d
Merge: a04212e e564324
Author: maddyscientist <mclark@nvidia.com>
Date:   Mon Nov 21 14:52:04 2016 -0700

    Merge pull request #525 from lattice/hotfix/p2p_init

    remove spurious check for compute mode in peer2peer initialization

I'll try to get the QUDA test to run...

kostrzewa commented 7 years ago

Using invert_test without loading a gauge configuration, I get a non-converging solver with the following parameters, but I can't reproduce the crash:

--prec double --prec-sloppy single --recon 18 --recon-sloppy 18 --inv-type cg --sdim 48 --tdim 24 --tgrids
ize 4 --tune true --niter 5000 --tolhq 0 --dslash-type twisted-mass --flavor plus --mass -0.802271666 --tol 1e-6

However, there is a difference between the inversions because when run through tmLQCD, there is an additional basis change: N4quda10PackSpinorIddLi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEENS_19ChiralToNonRelBasisIddLi4ELi3EEEEE since we use QUDA_CHIRAL_GAMMA_BASIS. Not sure if that could have something to do with it...

maddyscientist commented 7 years ago

@kostrzewa can you try branch feature/memory-pool? I have fixed some an uninitialized variables in that's gone unnoticed for too long (remnant code from experiments in overlapping domains that I hadn't appreciated before) that could be the cause of this.

kostrzewa commented 7 years ago

@maddyscientist The job is in the queue. I found it a bit surprising though that 4 nodes (16 GPUs with 12 GB of memory each) are seemingly insufficient for a 96*48^3 lattice... (twisted+clover)

# QUDA: time spent in reorder_spinor_toQuda: 0.087253 secs
# QUDA: ERROR: Failed to allocate device memory of size 9289728 (cuda_color_spinor_field.cu:605 in allocateGhostBuffer())
# QUDA: ERROR: Aborting (rank 0, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 2, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 3, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 1, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 9, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 15, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 5, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 13, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 10, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 12, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 11, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 14, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 8, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 4, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 6, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 7, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA:        last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)

I'm trying on 8 nodes now, but the machine is very busy and it will take a while until the job runs.

maddyscientist commented 7 years ago

One side effect of using the pool memory manager is that memory consumption can increase a bit. It's possible you're running afoul of this. If you set the envarg QUDA_ENABLE_DEVICE_MEMORY_POOL=0 you can disable this to see if this is the cause.

kostrzewa commented 7 years ago

In the first test, I set all the memory pool environment variables that you mentioned in the pull request to 0 and disabled tuning.

maddyscientist commented 7 years ago

Well there goes that theory......

kostrzewa commented 7 years ago

I've been able to bisect the commit log for the develop branch and have found a point in history which seems to fix all the twisted mass issues. Essentially, anything later than 017870e48532a3281f33516af209c5cf2515aad6 makes it impossible for us to use QUDA on Jureca right now. In particular, the merging of lattice/hotfix/qdpjit-p2p (bd33550) seems to break the code on this machine.

My first thought was that the process pinning done by SLURM is to blame, but since invert_test with a random gauge field works, I doubt that SLURM is the culprit. (I haven't tested with a real gauge field, because I don't have QUDA compiled with QIO and all that jazz) We diverge from how invert_test calls the solver in that we use QUDA_CHIRAL_GAMMA_BASIS and we have to remap the comms grid because of our TXYZ ordering rather than ZYXT. Maybe these two things explain what's going on...

kostrzewa commented 7 years ago

I haven't yet had the chance to test on another machine, hopefully I'll get around to it soon, but we have some compiler updates to do before I can get the latest commits compiled.

kostrzewa commented 7 years ago

I may have rejoiced a bit too soon because the rate of convergence seems to depend quite strongly on the number of MPI processes with the develop branch at (017870e48532a3281f33516af209c5cf2515aad6). Going from 4 to 8 nodes (16 to 32 K80 GPUs), I see an increase of about 10% in the number of iterations.

kostrzewa commented 7 years ago

The effect might be independent of the parallelisation, however, because I also see strong fluctuations in the number of iterations to convergence in the inversion with a number of different stochastic time-slice (wall) sources (on the same configuration).

# QUDA: CG: Convergence at 16112 iterations, L2 relative residual: iterated = 3.153934e-10, true = 3.153934e-10
# QUDA: CG: Convergence at 16301 iterations, L2 relative residual: iterated = 3.143852e-10, true = 3.143852e-10
# QUDA: CG: Convergence at 16213 iterations, L2 relative residual: iterated = 3.145130e-10, true = 3.145130e-10
# QUDA: CG: Convergence at 16063 iterations, L2 relative residual: iterated = 3.161826e-10, true = 3.161826e-10
# QUDA: CG: Convergence at 16033 iterations, L2 relative residual: iterated = 3.161993e-10, true = 3.161993e-10
# QUDA: CG: Convergence at 21126 iterations, L2 relative residual: iterated = 3.151072e-10, true = 3.151072e-10
# QUDA: CG: Convergence at 16264 iterations, L2 relative residual: iterated = 3.160960e-10, true = 3.160960e-10
# QUDA: CG: Convergence at 16270 iterations, L2 relative residual: iterated = 3.161078e-10, true = 3.161078e-10
# QUDA: CG: Convergence at 16289 iterations, L2 relative residual: iterated = 3.159118e-10, true = 3.159118e-10
# QUDA: CG: Convergence at 16170 iterations, L2 relative residual: iterated = 3.161810e-10, true = 3.161810e-10
# QUDA: CG: Convergence at 17296 iterations, L2 relative residual: iterated = 3.152252e-10, true = 3.152252e-10
# QUDA: CG: Convergence at 16437 iterations, L2 relative residual: iterated = 3.161335e-10, true = 3.161335e-10

maddyscientist commented 7 years ago

@kostrzewa I don't think those are large fluctuations (one comment up), and are to be expected with a solver like conjugate gradient (would be much worse with BiCGStab).

On the other hand, to test if the fluctuations you are seeing from 4 to 8 nodes is significant you should keep the process count constant, and rerun any test with a fresh tune cache. There are necessarily fluctuations arising from the auto tuner picking different block size in the reduction steps, which leads to changes in iteration count when changing process topology. Alternatively, you could test at 4 and 8 nodes (for example) but disable the autotuner for both runs. This will not make the solver reproducible when changing the process count (e.g., due to variations in the order of the dslash summation at each site on the boundary), but it would reveal whether the auto tuning of block size is the main driver of fluctuations.

Long term, one thing I'd like to include as an option is the ability to do infinite precision reductions which would remove this reduction block size variability completely (e.g., http://dx.doi.org/10.1016/j.parco.2015.09.001, paper is here ). I note that QUDA already has an option to do the reduction in emulated quad precision which damps this issue (though that code needs a bit of a cleanup).

kostrzewa commented 7 years ago

@maddyscientist

I don't think those are large fluctuations (one comment up), and are to be expected with a solver like conjugate gradient (would be much worse with BiCGStab).

They are significantly larger than I'm used to with twisted clover fermions. Fluctuations of O(5-10%) do occur, but the 21126 seen in one instance is somewhat surprising. I understand that the auto-tuner can induce very different rounding behaviour between block sizes.

I will try to investigate this a bit more if I get a chance, for now I think it seems to be working with aforementioned commit before pinned_malloc was added for QDPJIT.

Thanks a lot for the reference. In tmLQCD, we make use of Kahan summations in many places. It might be worth to replace these with superaccumulators...

kostrzewa commented 7 years ago

Just a comment before the hackathon, this may have relevance to the weird behaviour we observed on Jureca.

*          ### Important information for users with multi-GPU jobs ###         *
*                                                                              *
* An error was found in the implementation of the mechanism that regulates     *
* access to GPU devices. Users of multiple GPUs per node may faces problems.   *
* A bug fix is in the making. In the meantime a workaround is available:       *
* Please use "srun [srun arguments] fix-gpu-jail [app.] [app. args]" instead   *
* of "srun [srun arguments] [app.] [app. args]" to start your application.     *
* We apologize for the inconvenience.                                          *
*                                                                              *
*                                                                   2017-02-24 *

mathiaswagner commented 7 years ago

Thanks for sharing. Should also be posted to Slack.

On Mar 3, 2017, at 17:55, Bartosz Kostrzewa notifications@github.com<mailto:notifications@github.com> wrote:

Just a comment before the hackathon, this may have relevance to the weird behaviour we observed on Jureca.

Important information for users with multi-GPU jobs ### *
*
An error was found in the implementation of the mechanism that regulates *
access to GPU devices. Users of multiple GPUs per node may faces problems. *
A bug fix is in the making. In the meantime a workaround is available: *
Please use "srun [srun arguments] fix-gpu-jail [app.] [app. args]" instead *
of "srun [srun arguments] [app.] [app. args]" to start your application. *
We apologize for the inconvenience. *
*
2017-02-24 *
You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lattice/quda/issues/527#issuecomment-284007684, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AETKLyyEaOvG9K-zPKDReZLzN2Aqw7LVks5riEWYgaJpZM4K93bb.

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361 Managing Director: Karen Theresa Burns

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

maddyscientist commented 7 years ago

Let's hope that solves all Jureca issues!

kostrzewa commented 7 years ago

It seems that during the hackathon I was able to find a setup which seems to work. I've documented the build process here: https://github.com/etmc/tmLQCD/wiki/tmLQCD---QUDA

I need still need to check the reliability issues that I observed with a particular workload for sequential propagators with twisted boundary conditions, but it looks as though I'll be able to close this soon.

maddyscientist commented 7 years ago

Closing this issue, since this now seems to be resolved.

lattice / quda

device_pinned_malloc fails consistently on Jureca #527

Important information for users with multi-GPU jobs ### *