Closed kostrzewa closed 7 years ago
The issue persists for Wilson quarks:
# QUDA: Creating a DiracWilsonPC operator
# QUDA: Creating a DiracWilsonPC operator
# QUDA: Creating a DiracWilsonPC operator
# QUDA: Tuned block=(32,1,1), grid=(1,1,1), shared=0, giving 0.00 Gflop/s, 7.68 GB/s for cudaMemcpyHostToDevice with loadSpinorField,cuda_color_spinor_field.cu:530
# QUDA: Tuned block=(32,1,1), shared=0, giving 0.00 Gflop/s, 106.95 GB/s for N4quda10PackSpinorIddLi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEENS_19ChiralToNonRelBasisIddLi4ELi3EEEEE with out_stride=663552,in_stride=663552
# QUDA: Tuned block=(1024,1,1), grid=(26,2,1), shared=0, giving 44.01 Gflop/s, 176.03 GB/s for N4quda4blas5Norm2Id7double2S2_EE with vol=1327104,stride=663552,precision=8,TwistFlavour=-1
# QUDA: Source: CPU = 2.5478e+08, CUDA copy = 2.5478e+08
# QUDA: Solution: CPU = 0, CUDA copy = 0
# QUDA: ERROR: Failed to allocate device memory (cuda_color_spinor_field.cu:605 in allocateGhostBuffer())
# QUDA: ERROR: Aborting (rank 0, host jrc0002, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: ERROR: Aborting (rank 4, host jrc0003, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda4blas5Norm2Id7double2S2_EE,volume=24x48x48x24,aux=vol=1327104,stride=663552,precision=8,TwistFlavour=-1)
# QUDA: ERROR: Aborting (rank 2, host jrc0002, malloc.cpp:195 in device_pinned_malloc_())
Can you share QUDA, nvcc, CUDA driver versions for reference? Can it be reproduced with one of the quda internal tests?
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
Mon Nov 28 16:34:50 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:06:00.0 Off | 0 |
| N/A 26C P8 26W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:07:00.0 Off | 0 |
| N/A 31C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:86:00.0 Off | 0 |
| N/A 31C P8 26W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:87:00.0 Off | 0 |
| N/A 26C P8 29W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
QUDA develop branch, latest commit:
commit 5dc81648f268a8c9c400619780dab82ac90d7c3d
Merge: a04212e e564324
Author: maddyscientist <mclark@nvidia.com>
Date: Mon Nov 21 14:52:04 2016 -0700
Merge pull request #525 from lattice/hotfix/p2p_init
remove spurious check for compute mode in peer2peer initialization
I'll try to get the QUDA test to run...
Using invert_test
without loading a gauge configuration, I get a non-converging solver with the following parameters, but I can't reproduce the crash:
--prec double --prec-sloppy single --recon 18 --recon-sloppy 18 --inv-type cg --sdim 48 --tdim 24 --tgrids
ize 4 --tune true --niter 5000 --tolhq 0 --dslash-type twisted-mass --flavor plus --mass -0.802271666 --tol 1e-6
However, there is a difference between the inversions because when run through tmLQCD, there is an additional basis change: N4quda10PackSpinorIddLi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEENS_19ChiralToNonRelBasisIddLi4ELi3EEEEE
since we use QUDA_CHIRAL_GAMMA_BASIS
. Not sure if that could have something to do with it...
@kostrzewa can you try branch feature/memory-pool? I have fixed some an uninitialized variables in that's gone unnoticed for too long (remnant code from experiments in overlapping domains that I hadn't appreciated before) that could be the cause of this.
@maddyscientist The job is in the queue. I found it a bit surprising though that 4 nodes (16 GPUs with 12 GB of memory each) are seemingly insufficient for a 96*48^3 lattice... (twisted+clover)
# QUDA: time spent in reorder_spinor_toQuda: 0.087253 secs
# QUDA: ERROR: Failed to allocate device memory of size 9289728 (cuda_color_spinor_field.cu:605 in allocateGhostBuffer())
# QUDA: ERROR: Aborting (rank 0, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 2, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 3, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 1, host jrc0020, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 9, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 15, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 5, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 13, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 10, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 12, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 11, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 14, host jrc0032, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 8, host jrc0030, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 4, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 6, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
# QUDA: ERROR: Aborting (rank 7, host jrc0028, malloc.cpp:195 in device_pinned_malloc_())
# QUDA: last kernel called was (name=N4quda21TwistCloverGamma5CudaI7double2S1_EE,volume=12x24x24x48,aux=vol=331776,stride=331776,precision=8,TwistFlavour=-1,inverse)
I'm trying on 8 nodes now, but the machine is very busy and it will take a while until the job runs.
One side effect of using the pool memory manager is that memory consumption can increase a bit. It's possible you're running afoul of this. If you set the envarg QUDA_ENABLE_DEVICE_MEMORY_POOL=0 you can disable this to see if this is the cause.
In the first test, I set all the memory pool environment variables that you mentioned in the pull request to 0 and disabled tuning.
Well there goes that theory......
I've been able to bisect the commit log for the develop branch and have found a point in history which seems to fix all the twisted mass issues. Essentially, anything later than 017870e48532a3281f33516af209c5cf2515aad6 makes it impossible for us to use QUDA on Jureca right now. In particular, the merging of lattice/hotfix/qdpjit-p2p (bd33550) seems to break the code on this machine.
My first thought was that the process pinning done by SLURM is to blame, but since invert_test
with a random gauge field works, I doubt that SLURM is the culprit. (I haven't tested with a real gauge field, because I don't have QUDA compiled with QIO and all that jazz) We diverge from how invert_test
calls the solver in that we use QUDA_CHIRAL_GAMMA_BASIS
and we have to remap the comms grid because of our TXYZ ordering rather than ZYXT. Maybe these two things explain what's going on...
I haven't yet had the chance to test on another machine, hopefully I'll get around to it soon, but we have some compiler updates to do before I can get the latest commits compiled.
I may have rejoiced a bit too soon because the rate of convergence seems to depend quite strongly on the number of MPI processes with the develop branch at (017870e48532a3281f33516af209c5cf2515aad6). Going from 4 to 8 nodes (16 to 32 K80 GPUs), I see an increase of about 10% in the number of iterations.
The effect might be independent of the parallelisation, however, because I also see strong fluctuations in the number of iterations to convergence in the inversion with a number of different stochastic time-slice (wall) sources (on the same configuration).
# QUDA: CG: Convergence at 16112 iterations, L2 relative residual: iterated = 3.153934e-10, true = 3.153934e-10
# QUDA: CG: Convergence at 16301 iterations, L2 relative residual: iterated = 3.143852e-10, true = 3.143852e-10
# QUDA: CG: Convergence at 16213 iterations, L2 relative residual: iterated = 3.145130e-10, true = 3.145130e-10
# QUDA: CG: Convergence at 16063 iterations, L2 relative residual: iterated = 3.161826e-10, true = 3.161826e-10
# QUDA: CG: Convergence at 16033 iterations, L2 relative residual: iterated = 3.161993e-10, true = 3.161993e-10
# QUDA: CG: Convergence at 21126 iterations, L2 relative residual: iterated = 3.151072e-10, true = 3.151072e-10
# QUDA: CG: Convergence at 16264 iterations, L2 relative residual: iterated = 3.160960e-10, true = 3.160960e-10
# QUDA: CG: Convergence at 16270 iterations, L2 relative residual: iterated = 3.161078e-10, true = 3.161078e-10
# QUDA: CG: Convergence at 16289 iterations, L2 relative residual: iterated = 3.159118e-10, true = 3.159118e-10
# QUDA: CG: Convergence at 16170 iterations, L2 relative residual: iterated = 3.161810e-10, true = 3.161810e-10
# QUDA: CG: Convergence at 17296 iterations, L2 relative residual: iterated = 3.152252e-10, true = 3.152252e-10
# QUDA: CG: Convergence at 16437 iterations, L2 relative residual: iterated = 3.161335e-10, true = 3.161335e-10
@kostrzewa I don't think those are large fluctuations (one comment up), and are to be expected with a solver like conjugate gradient (would be much worse with BiCGStab).
On the other hand, to test if the fluctuations you are seeing from 4 to 8 nodes is significant you should keep the process count constant, and rerun any test with a fresh tune cache. There are necessarily fluctuations arising from the auto tuner picking different block size in the reduction steps, which leads to changes in iteration count when changing process topology. Alternatively, you could test at 4 and 8 nodes (for example) but disable the autotuner for both runs. This will not make the solver reproducible when changing the process count (e.g., due to variations in the order of the dslash summation at each site on the boundary), but it would reveal whether the auto tuning of block size is the main driver of fluctuations.
Long term, one thing I'd like to include as an option is the ability to do infinite precision reductions which would remove this reduction block size variability completely (e.g., http://dx.doi.org/10.1016/j.parco.2015.09.001, paper is here ). I note that QUDA already has an option to do the reduction in emulated quad precision which damps this issue (though that code needs a bit of a cleanup).
@maddyscientist
I don't think those are large fluctuations (one comment up), and are to be expected with a solver like conjugate gradient (would be much worse with BiCGStab).
They are significantly larger than I'm used to with twisted clover fermions. Fluctuations of O(5-10%) do occur, but the 21126 seen in one instance is somewhat surprising. I understand that the auto-tuner can induce very different rounding behaviour between block sizes.
I will try to investigate this a bit more if I get a chance, for now I think it seems to be working with aforementioned commit before pinned_malloc
was added for QDPJIT.
Thanks a lot for the reference. In tmLQCD, we make use of Kahan summations in many places. It might be worth to replace these with superaccumulators...
Just a comment before the hackathon, this may have relevance to the weird behaviour we observed on Jureca.
* ### Important information for users with multi-GPU jobs ### *
* *
* An error was found in the implementation of the mechanism that regulates *
* access to GPU devices. Users of multiple GPUs per node may faces problems. *
* A bug fix is in the making. In the meantime a workaround is available: *
* Please use "srun [srun arguments] fix-gpu-jail [app.] [app. args]" instead *
* of "srun [srun arguments] [app.] [app. args]" to start your application. *
* We apologize for the inconvenience. *
* *
* 2017-02-24 *
Thanks for sharing. Should also be posted to Slack.
On Mar 3, 2017, at 17:55, Bartosz Kostrzewa notifications@github.com<mailto:notifications@github.com> wrote:
Just a comment before the hackathon, this may have relevance to the weird behaviour we observed on Jureca.
*
An error was found in the implementation of the mechanism that regulates *
access to GPU devices. Users of multiple GPUs per node may faces problems. *
A bug fix is in the making. In the meantime a workaround is available: *
Please use "srun [srun arguments] fix-gpu-jail [app.] [app. args]" instead *
of "srun [srun arguments] [app.] [app. args]" to start your application. *
We apologize for the inconvenience. *
*
2017-02-24 *
You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lattice/quda/issues/527#issuecomment-284007684, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AETKLyyEaOvG9K-zPKDReZLzN2Aqw7LVks5riEWYgaJpZM4K93bb.
NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361 Managing Director: Karen Theresa Burns
Let's hope that solves all Jureca issues!
It seems that during the hackathon I was able to find a setup which seems to work. I've documented the build process here: https://github.com/etmc/tmLQCD/wiki/tmLQCD---QUDA
I need still need to check the reliability issues that I observed with a particular workload for sequential propagators with twisted boundary conditions, but it looks as though I'll be able to close this soon.
Closing this issue, since this now seems to be resolved.
On Jureca, for some reason, I see
device_pinned_malloc
failing consistently like so:I've increased the number of used devices to make sure that this is not just an "out of memory" issue, but a 48^3 lattice should fit without problems on 16 K80s... Have you observed anything similar elsewhere?