ERROR: Failed to register pinned memory

dwdoerf commented 3 years ago

I'm using QUDA with MILC on the Cori GPU cluster at NERSC. I'm able to successfully run using a lattice of size 32^4. In this case, QUDA reports that the total device memory used is ~8.5 GB:

Device memory used = 8550.8 MB
Pinned device memory used = 0.0 MB
Managed memory used = 1776.0 MB
Page-locked host memory used = 6398.8 MB
Total host memory used >= 9380.8 MB

I'm targeting an Nvidia A100 GPU with 40 GB of memory, so I believe I have plenty of memory for a 32x32x32x64 lattice, i.e. 2x in size. However, when I try this larger lattice I get ERROR: Failed to register pinned memory (see below). Is there more temporary space being used than I imagine and there truly is not enough memory? Or is there some other issue with CUDA, or my setup that I'm running into?

*
*
*
ON EACH NODE (RANK) 32 x 32 x 32 x 64
Mallocing 3288.3 MBytes per node for lattice
Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
QUDA 1.0.0 (git v0.9.0-3434-g2275f54f4-sm_80)
CUDA Driver version = 11000
CUDA Runtime version = 11000
Found device 0: A100-SXM4-40GB
Using device 0: A100-SXM4-40GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
Loaded 315 sets of cached parameters from /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/milc-benchmark/study3/qudatune.dgx/tunecache.tsv
**ERROR: Failed to register pinned memory of size 3288334336 (/global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/milc_interface.cpp:161 in qudaAllocatePinned())**
 (rank 0, host cgpu19, /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/targets/cuda/malloc.cpp:299 in pinned_malloc_())
       last kernel called was (name=,volume=,aux=)
Saving 315 sets of cached parameters to /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/milc-benchmark/study3/qudatune.dgx/tunecache_error.tsv
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1083447.2 ON cgpu19 CANCELLED AT 2020-10-14T13:30:09 ***
srun: error: cgpu19: task 0: Killed
srun: Terminating job step 1083447.2
./run_milc-dgx.sh: END TIME: Wed Oct 14 13:30:09 PDT 2020

weinbe2 commented 3 years ago

The issue here is you're running out of pinnable host memory at this problem size, not GPU memory (nor necessarily host memory), which is something we need to adopt to on our end. For context, MILC+QUDA uses pinned host memory to give more performant host/device copies. Can you try two separate experiments for me, both requiring modifications to generic_ks/fn_links_milc.c:

Line 30, add:
```
printf("create_G_special allocating %lu bytes\n", sites_on_node*4*sizeof(su3_matrix)); fflush(stdout);
```
This is to track the number of allocations of gauge fields, which should be the dominant cost. MILC tends to allocate a few more of these than you'd expect...

Change to non-pinned allocations, hackishly modifying lines 15 and 16 to malloc and free:

[...]
#include "../include/generic_quda.h"
#define special_alloc malloc
#define special_free free
#else
[...]

You can leave the change from (1) in, but make sure you at least run with change (1) without change (2) first.

Please let me know what you find! This will inform the types of changes we'll need to make to avoid this problem.

dwdoerf commented 3 years ago

Neither of these mods seem to be having an effect? The output is the same. Here's the mods I made, and of course a recompile:

diff --git a/generic_ks/fn_links_milc.c b/generic_ks/fn_links_milc.c
index b225da2f..9e298059 100644
--- a/generic_ks/fn_links_milc.c
+++ b/generic_ks/fn_links_milc.c
@@ -12,8 +12,8 @@
 #define special_free qfree
 #elif defined(USE_FL_GPU)
 #include "../include/generic_quda.h"
-#define special_alloc qudaAllocatePinned
-#define special_free qudaFreePinned
+#define special_alloc malloc
+#define special_free free
 #else
 #define special_alloc malloc
 #define special_free free
@@ -28,6 +28,8 @@ create_G_special(void){
   char myname[] = "create_G_special";
   su3_matrix *m;

+  printf("create_G_special allocating %lu bytes\n", sites_on_node*4*sizeof(su3_matrix)); fflush(stdout);
+
   m = (su3_matrix *)special_alloc(sites_on_node*4*sizeof(su3_matrix));
   if(m==NULL){
     printf("%s: no room\n",myname);

dwdoerf commented 3 years ago

And I scanned the output for the extra printf and found nothing?

detar commented 3 years ago

Hi Doug,

Could you point me to the full output file?

Thanks,

Carleton

On 10/14/20 4:17 PM, dwdoerf wrote:

And I scanned the output for the extra printf and found nothing?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1065#issuecomment-708689969, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXW7BFT25EOXMRJK7WTSKYPRLANCNFSM4SRDF7RQ.

dwdoerf commented 3 years ago

Hi Carleton, here you go:

cori10:study3$ cat slurm-1083969.out
./run_milc-dgx.sh: USING milc_in_brief.sh
./run_milc-dgx.sh: Running MILC with lattice of dimension 32x32x32x64
./run_milc-dgx.sh: OMP_PLACES=threads; OMP_PROC_BIND=spread; OMP_NUM_THREADS=8; srun -n 1 -c 16 --cpu_bind=cores  ./su3_rhmd_hisq.dgx
./run_milc-dgx.sh: BEGIN TIME: Wed Oct 14 16:20:00 PDT 2020
SU3 with improved KS action
Microcanonical simulation with refreshing
Rational function hybrid Monte Carlo algorithm
MIMD version 7.8.1
Machine = MPI (portable), with 1 nodes
Host(0) = cgpu19
Username = dwdoerf
start: Wed Oct 14 16:20:02 2020

Options selected...
Generic double precision
C_GLOBAL_INLINE
FEWSUMS
KS_MULTICG=HYBRID
KS_MULTIFF=FNMAT
VECLENGTH=4
INT_ALG=INT_3G1F
HISQ_REUNIT_ALLOW_SVD
HISQ_REUNIT_SVD_REL_ERROR = 1e-08
HISQ_REUNIT_SVD_ABS_ERROR = 1e-08
HISQ_FORCE_FILTER = 5e-05
HISQ_FF_MULTI_WRAPPER is ON
type 0 for no prompts, 1 for prompts, or 2 for proofreading
nx 32
ny 32
nz 32
nt 64
iseed 4563421
n_pseudo 5
load_rhmc_params rationals_m001m05m5.test1
beta 6.3
n_dyn_masses 3
dyn_mass 0.001 0.05 0.5 
dyn_flavors 2 1 1 
u0 0.89
n_pseudo 5
Loading rational function parameters for phi field 0
naik_term_epsilon 0
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (9,9)
# Approximating the function (x+4*0.001000^2)^(2/4) (x+4*0.050000^2)^(1/4) (x+4*0.200000^2)^(-3/4) (x+4*99.900000^2)^(0/4)
# Converged at 2795 iterations, error = 2.934506e-06
# Rational function for MD
y_MD -2 -1 3 0 
z_MD 4 4 4 4 
m_MD 0.001 0.05 0.2 99.9 
order_MD 9
Loading order 9 rational function approximation for MD:
f(x) = (x+4*0.001000^2)^(-2/4) (x+4*0.050000^2)^(-1/4)
       (x+4*0.200000^2)^(3/4) (x+4*99.900000^2)^(0/4)
res_MD 1
res_MD 0.000755643
res_MD 0.00113988
res_MD 0.00210416
res_MD 0.00414249
res_MD 0.00832302
res_MD 0.0170731
res_MD 0.0376747
res_MD 0.0297019
res_MD 0.0165647
pole_MD 99.9
pole_MD 4.50838e-06
pole_MD 1.02574e-05
pole_MD 3.49278e-05
pole_MD 0.000134989
pole_MD 0.00053851
pole_MD 0.00214814
pole_MD 0.00804655
pole_MD 0.0279606
pole_MD 0.0900058
# CHECK: f(1.000000e-15) = 3.999988e+02 = 4.000000e+02?
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (11,11)
# Approximating the function (x+4*0.001000^2)^(2/8) (x+4*0.050000^2)^(1/8) (x+4*0.200000^2)^(-3/8) (x+4*99.900000^2)^(0/8)
# Converged at 3398 iterations, error = 1.285453e-07
# Rational function for GR
y_GR 2 1 -3 0 
z_GR 8 8 8 8 
m_GR 0.001 0.05 0.2 99.9 
order_GR 11
Loading order 11 rational function approximation for GR:
f(x) = (x+4*0.001000^2)^(2/8) (x+4*0.050000^2)^(1/8)
       (x+4*0.200000^2)^(-3/8) (x+4*99.900000^2)^(0/8)
res_GR 1
res_GR -2.35291e-08
res_GR -1.34601e-07
res_GR -6.10251e-07
res_GR -2.65892e-06
res_GR -1.14554e-05
res_GR -4.8921e-05
res_GR -0.00020428
res_GR -0.000941406
res_GR -0.00506833
res_GR -0.0181194
res_GR -0.0343505
pole_GR 99.9
pole_GR 5.16931e-06
pole_GR 1.11104e-05
pole_GR 3.09265e-05
pole_GR 9.4835e-05
pole_GR 0.000300244
pole_GR 0.000959821
pole_GR 0.00307556
pole_GR 0.0102991
pole_GR 0.0303258
pole_GR 0.077846
pole_GR 0.143892
# CHECK: f(1.000000e-15) = 5.000001e-02 = 5.000000e-02?
# Rational function for FA
y_FA -2 -1 3 0 
z_FA 8 8 8 8 
m_FA 0.001 0.05 0.2 99.9 
order_FA 11
Loading order 11 rational function approximation for FA:
f(x) = (x+4*0.001000^2)^(-2/8) (x+4*0.050000^2)^(-1/8)
       (x+4*0.200000^2)^(3/8) (x+4*99.900000^2)^(0/8)
res_FA 1
res_FA 1.36701e-05
res_FA 3.28201e-05
res_FA 7.60154e-05
res_FA 0.000179933
res_FA 0.000430396
res_FA 0.00103607
res_FA 0.00253036
res_FA 0.00710655
res_FA 0.0143325
res_FA 0.0196813
res_FA 0.0133282
pole_FA 99.9
pole_FA 4.58349e-06
pole_FA 8.87395e-06
pole_FA 2.3628e-05
pole_FA 7.13459e-05
pole_FA 0.00022477
pole_FA 0.000717546
pole_FA 0.00229847
pole_FA 0.00741595
pole_FA 0.0204582
pole_FA 0.0560548
pole_FA 0.120814
Loading rational function parameters for phi field 1
# CHECK: f(1.000000e-15) = 2.000000e+01 = 2.000000e+01?
naik_term_epsilon 0
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (7,7)
# Approximating the function (x+4*0.200000^2)^(1/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
# Converged at 327 iterations, error = 2.398230e-07
# Rational function for MD
y_MD -1 0 0 0 
z_MD 4 4 4 4 
m_MD 0.2 99.9 99.9 99.9 
order_MD 7
Loading order 7 rational function approximation for MD:
f(x) = (x+4*0.200000^2)^(-1/4) (x+4*99.900000^2)^(0/4)
       (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
res_MD 0.14923
res_MD 0.046061
res_MD 0.1138
res_MD 0.274536
res_MD 0.687619
res_MD 1.83201
res_MD 5.87481
res_MD 38.0862
pole_MD 99.9
pole_MD 0.185283
pole_MD 0.375399
pole_MD 1.05812
pole_MD 3.40313
pole_MD 11.7405
pole_MD 45.73
pole_MD 283.916
# CHECK: f(1.000000e-15) = 1.581138e+00 = 1.581139e+00?
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (9,9)
# Approximating the function (x+4*0.200000^2)^(1/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
# Converged at 422 iterations, error = 1.700228e-09
# Rational function for GR
y_GR 1 0 0 0 
z_GR 8 8 8 8 
m_GR 0.2 99.9 99.9 99.9 
order_GR 9
Loading order 9 rational function approximation for GR:
f(x) = (x+4*0.200000^2)^(1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_GR 2.73277
res_GR -0.00512886
res_GR -0.0204789
res_GR -0.0639514
res_GR -0.192686
res_GR -0.585195
res_GR -1.85803
res_GR -6.68946
res_GR -34.4017
res_GR -617.898
pole_GR 99.9
pole_GR 0.186425
pole_GR 0.315826
pole_GR 0.679106
pole_GR 1.64209
pole_GR 4.20254
pole_GR 11.2153
pole_GR 32.0724
pole_GR 110.326
pole_GR 764.415
# CHECK: f(1.000000e-15) = 7.952707e-01 = 7.952707e-01?
# Rational function for FA
y_FA -1 0 0 0 
z_FA 8 8 8 8 
m_FA 0.2 99.9 99.9 99.9 
order_FA 9
Loading order 9 rational function approximation for FA:
f(x) = (x+4*0.200000^2)^(-1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_FA 0.365929
res_FA 0.0109316
res_FA 0.0292903
res_FA 0.0678756
res_FA 0.157093
res_FA 0.370027
res_FA 0.906451
res_FA 2.45234
res_FA 8.71503
res_FA 75.8971
pole_FA 99.9
pole_FA 0.178875
pole_FA 0.290944
pole_FA 0.612037
pole_FA 1.46486
pole_FA 3.72845
pole_FA 9.8933
pole_FA 27.9493
pole_FA 92.735
pole_FA 546.062
Loading rational function parameters for phi field 2
# CHECK: f(1.000000e-15) = 1.257433e+00 = 1.257433e+00?
naik_term_epsilon 0
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (7,7)
# Approximating the function (x+4*0.200000^2)^(1/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
# Converged at 327 iterations, error = 2.398230e-07
# Rational function for MD
y_MD -1 0 0 0 
z_MD 4 4 4 4 
m_MD 0.2 99.9 99.9 99.9 
order_MD 7
Loading order 7 rational function approximation for MD:
f(x) = (x+4*0.200000^2)^(-1/4) (x+4*99.900000^2)^(0/4)
       (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
res_MD 0.14923
res_MD 0.046061
res_MD 0.1138
res_MD 0.274536
res_MD 0.687619
res_MD 1.83201
res_MD 5.87481
res_MD 38.0862
pole_MD 99.9
pole_MD 0.185283
pole_MD 0.375399
pole_MD 1.05812
pole_MD 3.40313
pole_MD 11.7405
pole_MD 45.73
pole_MD 283.916
# CHECK: f(1.000000e-15) = 1.581138e+00 = 1.581139e+00?
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (9,9)
# Approximating the function (x+4*0.200000^2)^(1/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
# Converged at 422 iterations, error = 1.700228e-09
# Rational function for GR
y_GR 1 0 0 0 
z_GR 8 8 8 8 
m_GR 0.2 99.9 99.9 99.9 
order_GR 9
Loading order 9 rational function approximation for GR:
f(x) = (x+4*0.200000^2)^(1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_GR 2.73277
res_GR -0.00512886
res_GR -0.0204789
res_GR -0.0639514
res_GR -0.192686
res_GR -0.585195
res_GR -1.85803
res_GR -6.68946
res_GR -34.4017
res_GR -617.898
pole_GR 99.9
pole_GR 0.186425
pole_GR 0.315826
pole_GR 0.679106
pole_GR 1.64209
pole_GR 4.20254
pole_GR 11.2153
pole_GR 32.0724
pole_GR 110.326
pole_GR 764.415
# CHECK: f(1.000000e-15) = 7.952707e-01 = 7.952707e-01?
# Rational function for FA
y_FA -1 0 0 0 
z_FA 8 8 8 8 
m_FA 0.2 99.9 99.9 99.9 
order_FA 9
Loading order 9 rational function approximation for FA:
f(x) = (x+4*0.200000^2)^(-1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_FA 0.365929
res_FA 0.0109316
res_FA 0.0292903
res_FA 0.0678756
res_FA 0.157093
res_FA 0.370027
res_FA 0.906451
res_FA 2.45234
res_FA 8.71503
res_FA 75.8971
pole_FA 99.9
pole_FA 0.178875
pole_FA 0.290944
pole_FA 0.612037
pole_FA 1.46486
pole_FA 3.72845
pole_FA 9.8933
pole_FA 27.9493
pole_FA 92.735
pole_FA 546.062
Loading rational function parameters for phi field 3
# CHECK: f(1.000000e-15) = 1.257433e+00 = 1.257433e+00?
naik_term_epsilon 0
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (7,7)
# Approximating the function (x+4*0.200000^2)^(1/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
# Converged at 327 iterations, error = 2.398230e-07
# Rational function for MD
y_MD -1 0 0 0 
z_MD 4 4 4 4 
m_MD 0.2 99.9 99.9 99.9 
order_MD 7
Loading order 7 rational function approximation for MD:
f(x) = (x+4*0.200000^2)^(-1/4) (x+4*99.900000^2)^(0/4)
       (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
res_MD 0.14923
res_MD 0.046061
res_MD 0.1138
res_MD 0.274536
res_MD 0.687619
res_MD 1.83201
res_MD 5.87481
res_MD 38.0862
pole_MD 99.9
pole_MD 0.185283
pole_MD 0.375399
pole_MD 1.05812
pole_MD 3.40313
pole_MD 11.7405
pole_MD 45.73
pole_MD 283.916
# CHECK: f(1.000000e-15) = 1.581138e+00 = 1.581139e+00?
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (9,9)
# Approximating the function (x+4*0.200000^2)^(1/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
# Converged at 422 iterations, error = 1.700228e-09
# Rational function for GR
y_GR 1 0 0 0 
z_GR 8 8 8 8 
m_GR 0.2 99.9 99.9 99.9 
order_GR 9
Loading order 9 rational function approximation for GR:
f(x) = (x+4*0.200000^2)^(1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_GR 2.73277
res_GR -0.00512886
res_GR -0.0204789
res_GR -0.0639514
res_GR -0.192686
res_GR -0.585195
res_GR -1.85803
res_GR -6.68946
res_GR -34.4017
res_GR -617.898
pole_GR 99.9
pole_GR 0.186425
pole_GR 0.315826
pole_GR 0.679106
pole_GR 1.64209
pole_GR 4.20254
pole_GR 11.2153
pole_GR 32.0724
pole_GR 110.326
pole_GR 764.415
# CHECK: f(1.000000e-15) = 7.952707e-01 = 7.952707e-01?
# Rational function for FA
y_FA -1 0 0 0 
z_FA 8 8 8 8 
m_FA 0.2 99.9 99.9 99.9 
order_FA 9
Loading order 9 rational function approximation for FA:
f(x) = (x+4*0.200000^2)^(-1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_FA 0.365929
res_FA 0.0109316
res_FA 0.0292903
res_FA 0.0678756
res_FA 0.157093
res_FA 0.370027
res_FA 0.906451
res_FA 2.45234
res_FA 8.71503
res_FA 75.8971
pole_FA 99.9
pole_FA 0.178875
pole_FA 0.290944
pole_FA 0.612037
pole_FA 1.46486
pole_FA 3.72845
pole_FA 9.8933
pole_FA 27.9493
pole_FA 92.735
pole_FA 546.062
Loading rational function parameters for phi field 4
# CHECK: f(1.000000e-15) = 1.257433e+00 = 1.257433e+00?
naik_term_epsilon -0.15148
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (7,7)
# Approximating the function (x+4*0.500000^2)^(1/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
# Converged at 280 iterations, error = 4.040987e-09
# Rational function for MD
y_MD -1 0 0 0 
z_MD 4 4 4 4 
m_MD 0.5 99.9 99.9 99.9 
order_MD 7
Loading order 7 rational function approximation for MD:
f(x) = (x+4*0.500000^2)^(-1/4) (x+4*99.900000^2)^(0/4)
       (x+4*99.900000^2)^(0/4) (x+4*99.900000^2)^(0/4)
res_MD 0.13325
res_MD 0.126692
res_MD 0.273789
res_MD 0.55645
res_MD 1.18659
res_MD 2.78588
res_MD 8.31586
res_MD 53.2089
pole_MD 99.9
pole_MD 1.09963
pole_MD 1.77462
pole_MD 3.7756
pole_MD 9.34147
pole_MD 25.7022
pole_MD 83.6423
pole_MD 462.526
# CHECK: f(1.000000e-15) = 1.000000e+00 = 1.000000e+00?
# New rational function
# Approximation bounds are [1.000000e-15,9.000000e+01]
# Precision of arithmetic is 75
# Degree of the approximation is (9,9)
# Approximating the function (x+4*0.500000^2)^(1/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
# Converged at 361 iterations, error = 9.642632e-12
# Rational function for GR
y_GR 1 0 0 0 
z_GR 8 8 8 8 
m_GR 0.5 99.9 99.9 99.9 
order_GR 9
Loading order 9 rational function approximation for GR:
f(x) = (x+4*0.500000^2)^(1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_GR 2.89197
res_GR -0.0235578
res_GR -0.0828855
res_GR -0.217646
res_GR -0.544793
res_GR -1.38607
res_GR -3.77659
res_GR -12.1391
res_GR -58.589
res_GR -1030.78
pole_GR 99.9
pole_GR 1.10406
pole_GR 1.57368
pole_GR 2.71539
pole_GR 5.27759
pole_GR 11.0584
pole_GR 24.7339
pole_GR 60.9958
pole_GR 187.723
pole_GR 1219.23
# CHECK: f(1.000000e-15) = 1.000000e+00 = 1.000000e+00?
# Rational function for FA
y_FA -1 0 0 0 
z_FA 8 8 8 8 
m_FA 0.5 99.9 99.9 99.9 
order_FA 9
Loading order 9 rational function approximation for FA:
f(x) = (x+4*0.500000^2)^(-1/8) (x+4*99.900000^2)^(0/8)
       (x+4*99.900000^2)^(0/8) (x+4*99.900000^2)^(0/8)
res_FA 0.345785
res_FA 0.0359642
res_FA 0.0870126
res_FA 0.174625
res_FA 0.347156
res_FA 0.709395
res_FA 1.54168
res_FA 3.82992
res_FA 13.0407
res_FA 112.659
pole_FA 99.9
pole_FA 1.0747
pole_FA 1.48735
pole_FA 2.51677
pole_FA 4.83418
pole_FA 10.0472
pole_FA 22.2737
pole_FA 54.0493
pole_FA 159.625
pole_FA 875.531
Maximum rational func order is 11
Naik term correction structure of multi_x:
n_naiks 2
n_pseudo_naik[0]=4
n_orders_naik[0]=30
eps_naik[0]=0.000000
n_pseudo_naik[1]=1
n_orders_naik[1]=7
eps_naik[1]=-0.151480
n_order_naik_total 37
LAYOUT = Hypercubes, options = hyper_prime,
automatic hyper_prime layout
ON EACH NODE (RANK) 32 x 32 x 32 x 64
Mallocing 3288.3 MBytes per node for lattice
Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
QUDA 1.0.0 (git v0.9.0-3434-g2275f54f4-sm_80)
CUDA Driver version = 11000
CUDA Runtime version = 11000
Found device 0: A100-SXM4-40GB
Using device 0: A100-SXM4-40GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
Loaded 315 sets of cached parameters from /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/milc-benchmark/study3/qudatune.dgx/tunecache.tsv
ERROR: Failed to register pinned memory of size 3288334336 (/global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/milc_interface.cpp:161 in qudaAllocatePinned())
 (rank 0, host cgpu19, /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/targets/cuda/malloc.cpp:299 in pinned_malloc_())
       last kernel called was (name=,volume=,aux=)
Saving 315 sets of cached parameters to /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/milc-benchmark/study3/qudatune.dgx/tunecache_error.tsv
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1083969.0 ON cgpu19 CANCELLED AT 2020-10-14T16:20:17 ***
srun: error: cgpu19: task 0: Killed
srun: Terminating job step 1083969.0
./run_milc-dgx.sh: END TIME: Wed Oct 14 16:20:18 PDT 2020

detar commented 3 years ago

Hi Doug,

I added a cc to Evan because I don't know if he sees the github thread. The pinned memory failure is happening much before the fn_links procedure, so Evan's hacks won't help. On initialization before reading the lattice, the make_lattice routine allocates memory for the lattice. With QUDA enabled, that allocation is done with pinned memory. In this case it is about 3.3 GB and it is on a single node.

Best,

Carleton

On 10/14/20 5:23 PM, dwdoerf wrote:

Hi Carleton, here you go:

|cori10:study3$ cat slurm-1083969.out ./run_milc-dgx.sh: USING milc_in_brief.sh ./run_milc-dgx.sh: Running MILC with lattice of dimension 32x32x32x64 ./run_milc-dgx.sh: OMP_PLACES=threads; OMP_PROC_BIND=spread; OMP_NUM_THREADS=8; srun -n 1 -c 16 --cpu_bind=cores ./su3_rhmd_hisq.dgx ./run_milc-dgx.sh: BEGIN TIME: Wed Oct 14 16:20:00 PDT 2020 SU3 with improved KS action Microcanonical simulation with refreshing Rational function hybrid Monte Carlo algorithm MIMD version 7.8.1 Machine = MPI (portable), with 1 nodes Host(0) = cgpu19 Username = dwdoerf start: Wed Oct 14 16:20:02 2020 Options selected... Generic double precision C_GLOBAL_INLINE FEWSUMS KS_MULTICG=HYBRID KS_MULTIFF=FNMAT VECLENGTH=4 INT_ALG=INT_3G1F HISQ_REUNIT_ALLOW_SVD HISQ_REUNIT_SVD_REL_ERROR = 1e-08 HISQ_REUNIT_SVD_ABS_ERROR = 1e-08 HISQ_FORCE_FILTER = 5e-05 HISQ_FF_MULTI_WRAPPER is ON type 0 for no prompts, 1 for prompts, or 2 for proofreading nx 32 ny 32 nz 32 nt 64 iseed 4563421 n_pseudo 5 load_rhmc_params rationals_m001m05m5.test1 beta 6.3 n_dyn_masses 3 dyn_mass 0.001 0.05 0.5 dyn_flavors 2 1 1 u0 0.89 n_pseudo 5 Loading rational function parameters for phi field 0 naik_term_epsilon 0 # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (9,9) # Approximating the function (x+40.001000^2)^(2/4) (x+40.050000^2)^(1/4) (x+40.200000^2)^(-3/4) (x+499.900000^2)^(0/4) # Converged at 2795 iterations, error = 2.934506e-06 # Rational function for MD y_MD -2 -1 3 0 z_MD 4 4 4 4 m_MD 0.001 0.05 0.2 99.9 order_MD 9 Loading order 9 rational function approximation for MD: f(x) = (x+40.001000^2)^(-2/4) (x+40.050000^2)^(-1/4) (x+40.200000^2)^(3/4) (x+499.900000^2)^(0/4) res_MD 1 res_MD 0.000755643 res_MD 0.00113988 res_MD 0.00210416 res_MD 0.00414249 res_MD 0.00832302 res_MD 0.0170731 res_MD 0.0376747 res_MD 0.0297019 res_MD 0.0165647 pole_MD 99.9 pole_MD 4.50838e-06 pole_MD 1.02574e-05 pole_MD 3.49278e-05 pole_MD 0.000134989 pole_MD 0.00053851 pole_MD 0.00214814 pole_MD 0.00804655 pole_MD 0.0279606 pole_MD 0.0900058 # CHECK: f(1.000000e-15) = 3.999988e+02 = 4.000000e+02? # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (11,11) # Approximating the function (x+40.001000^2)^(2/8) (x+40.050000^2)^(1/8) (x+40.200000^2)^(-3/8) (x+499.900000^2)^(0/8) # Converged at 3398 iterations, error = 1.285453e-07 # Rational function for GR y_GR 2 1 -3 0 z_GR 8 8 8 8 m_GR 0.001 0.05 0.2 99.9 order_GR 11 Loading order 11 rational function approximation for GR: f(x) = (x+40.001000^2)^(2/8) (x+40.050000^2)^(1/8) (x+40.200000^2)^(-3/8) (x+499.900000^2)^(0/8) res_GR 1 res_GR -2.35291e-08 res_GR -1.34601e-07 res_GR -6.10251e-07 res_GR -2.65892e-06 res_GR -1.14554e-05 res_GR -4.8921e-05 res_GR -0.00020428 res_GR -0.000941406 res_GR -0.00506833 res_GR -0.0181194 res_GR -0.0343505 pole_GR 99.9 pole_GR 5.16931e-06 pole_GR 1.11104e-05 pole_GR 3.09265e-05 pole_GR 9.4835e-05 pole_GR 0.000300244 pole_GR 0.000959821 pole_GR 0.00307556 pole_GR 0.0102991 pole_GR 0.0303258 pole_GR 0.077846 pole_GR 0.143892 # CHECK: f(1.000000e-15) = 5.000001e-02 = 5.000000e-02? # Rational function for FA y_FA -2 -1 3 0 z_FA 8 8 8 8 m_FA 0.001 0.05 0.2 99.9 order_FA 11 Loading order 11 rational function approximation for FA: f(x) = (x+40.001000^2)^(-2/8) (x+40.050000^2)^(-1/8) (x+40.200000^2)^(3/8) (x+499.900000^2)^(0/8) res_FA 1 res_FA 1.36701e-05 res_FA 3.28201e-05 res_FA 7.60154e-05 res_FA 0.000179933 res_FA 0.000430396 res_FA 0.00103607 res_FA 0.00253036 res_FA 0.00710655 res_FA 0.0143325 res_FA 0.0196813 res_FA 0.0133282 pole_FA 99.9 pole_FA 4.58349e-06 pole_FA 8.87395e-06 pole_FA 2.3628e-05 pole_FA 7.13459e-05 pole_FA 0.00022477 pole_FA 0.000717546 pole_FA 0.00229847 pole_FA 0.00741595 pole_FA 0.0204582 pole_FA 0.0560548 pole_FA 0.120814 Loading rational function parameters for phi field 1 # CHECK: f(1.000000e-15) = 2.000000e+01 = 2.000000e+01? naik_term_epsilon 0 # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (7,7) # Approximating the function (x+40.200000^2)^(1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) # Converged at 327 iterations, error = 2.398230e-07 # Rational function for MD y_MD -1 0 0 0 z_MD 4 4 4 4 m_MD 0.2 99.9 99.9 99.9 order_MD 7 Loading order 7 rational function approximation for MD: f(x) = (x+40.200000^2)^(-1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) res_MD 0.14923 res_MD 0.046061 res_MD 0.1138 res_MD 0.274536 res_MD 0.687619 res_MD 1.83201 res_MD 5.87481 res_MD 38.0862 pole_MD 99.9 pole_MD 0.185283 pole_MD 0.375399 pole_MD 1.05812 pole_MD 3.40313 pole_MD 11.7405 pole_MD 45.73 pole_MD 283.916 # CHECK: f(1.000000e-15) = 1.581138e+00 = 1.581139e+00? # New rational function

Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of

arithmetic is 75 # Degree of the approximation is (9,9) # Approximating the function (x+40.200000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) # Converged at 422 iterations, error = 1.700228e-09 # Rational function for GR y_GR 1 0 0 0 z_GR 8 8 8 8 m_GR 0.2 99.9 99.9 99.9 order_GR 9 Loading order 9 rational function approximation for GR: f(x) = (x+40.200000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_GR 2.73277 res_GR -0.00512886 res_GR -0.0204789 res_GR -0.0639514 res_GR -0.192686 res_GR -0.585195 res_GR -1.85803 res_GR -6.68946 res_GR -34.4017 res_GR -617.898 pole_GR 99.9 pole_GR 0.186425 pole_GR 0.315826 pole_GR 0.679106 pole_GR 1.64209 pole_GR 4.20254 pole_GR 11.2153 pole_GR 32.0724 pole_GR 110.326 pole_GR 764.415 # CHECK: f(1.000000e-15) = 7.952707e-01 = 7.952707e-01? # Rational function for FA y_FA -1 0 0 0 z_FA 8 8 8 8 m_FA 0.2 99.9 99.9 99.9 order_FA 9 Loading order 9 rational function approximation for FA: f(x) = (x+40.200000^2)^(-1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_FA 0.365929 res_FA 0.0109316 res_FA 0.0292903 res_FA 0.0678756 res_FA 0.157093 res_FA 0.370027 res_FA 0.906451 res_FA 2.45234 res_FA 8.71503 res_FA 75.8971 pole_FA 99.9 pole_FA 0.178875 pole_FA 0.290944 pole_FA 0.612037 pole_FA 1.46486 pole_FA 3.72845 pole_FA 9.8933 pole_FA 27.9493 pole_FA 92.735 pole_FA 546.062 Loading rational function parameters for phi field 2 # CHECK: f(1.000000e-15) = 1.257433e+00 = 1.257433e+00? naik_term_epsilon 0 # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (7,7) # Approximating the function (x+40.200000^2)^(1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) # Converged at 327 iterations, error = 2.398230e-07 # Rational function for MD y_MD -1 0 0 0 z_MD 4 4 4 4 m_MD 0.2 99.9 99.9 99.9 order_MD 7 Loading order 7 rational function approximation for MD: f(x) = (x+40.200000^2)^(-1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) res_MD 0.14923 res_MD 0.046061 res_MD 0.1138 res_MD 0.274536 res_MD 0.687619 res_MD 1.83201 res_MD 5.87481 res_MD 38.0862 pole_MD 99.9 pole_MD 0.185283 pole_MD 0.375399 pole_MD 1.05812 pole_MD 3.40313 pole_MD 11.7405 pole_MD 45.73 pole_MD 283.916 # CHECK: f(1.000000e-15) = 1.581138e+00 = 1.581139e+00? # New rational function

Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of

arithmetic is 75 # Degree of the approximation is (9,9) # Approximating the function (x+40.200000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) # Converged at 422 iterations, error = 1.700228e-09 # Rational function for GR y_GR 1 0 0 0 z_GR 8 8 8 8 m_GR 0.2 99.9 99.9 99.9 order_GR 9 Loading order 9 rational function approximation for GR: f(x) = (x+40.200000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_GR 2.73277 res_GR -0.00512886 res_GR -0.0204789 res_GR -0.0639514 res_GR -0.192686 res_GR -0.585195 res_GR -1.85803 res_GR -6.68946 res_GR -34.4017 res_GR -617.898 pole_GR 99.9 pole_GR 0.186425 pole_GR 0.315826 pole_GR 0.679106 pole_GR 1.64209 pole_GR 4.20254 pole_GR 11.2153 pole_GR 32.0724 pole_GR 110.326 pole_GR 764.415 # CHECK: f(1.000000e-15) = 7.952707e-01 = 7.952707e-01? # Rational function for FA y_FA -1 0 0 0 z_FA 8 8 8 8 m_FA 0.2 99.9 99.9 99.9 order_FA 9 Loading order 9 rational function approximation for FA: f(x) = (x+40.200000^2)^(-1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_FA 0.365929 res_FA 0.0109316 res_FA 0.0292903 res_FA 0.0678756 res_FA 0.157093 res_FA 0.370027 res_FA 0.906451 res_FA 2.45234 res_FA 8.71503 res_FA 75.8971 pole_FA 99.9 pole_FA 0.178875 pole_FA 0.290944 pole_FA 0.612037 pole_FA 1.46486 pole_FA 3.72845 pole_FA 9.8933 pole_FA 27.9493 pole_FA 92.735 pole_FA 546.062 Loading rational function parameters for phi field 3 # CHECK: f(1.000000e-15) = 1.257433e+00 = 1.257433e+00? naik_term_epsilon 0 # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (7,7) # Approximating the function (x+40.200000^2)^(1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) # Converged at 327 iterations, error = 2.398230e-07 # Rational function for MD y_MD -1 0 0 0 z_MD 4 4 4 4 m_MD 0.2 99.9 99.9 99.9 order_MD 7 Loading order 7 rational function approximation for MD: f(x) = (x+40.200000^2)^(-1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) res_MD 0.14923 res_MD 0.046061 res_MD 0.1138 res_MD 0.274536 res_MD 0.687619 res_MD 1.83201 res_MD 5.87481 res_MD 38.0862 pole_MD 99.9 pole_MD 0.185283 pole_MD 0.375399 pole_MD 1.05812 pole_MD 3.40313 pole_MD 11.7405 pole_MD 45.73 pole_MD 283.916 # CHECK: f(1.000000e-15) = 1.581138e+00 = 1.581139e+00? # New rational function

Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of

arithmetic is 75 # Degree of the approximation is (9,9) # Approximating the function (x+40.200000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) # Converged at 422 iterations, error = 1.700228e-09 # Rational function for GR y_GR 1 0 0 0 z_GR 8 8 8 8 m_GR 0.2 99.9 99.9 99.9 order_GR 9 Loading order 9 rational function approximation for GR: f(x) = (x+40.200000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_GR 2.73277 res_GR -0.00512886 res_GR -0.0204789 res_GR -0.0639514 res_GR -0.192686 res_GR -0.585195 res_GR -1.85803 res_GR -6.68946 res_GR -34.4017 res_GR -617.898 pole_GR 99.9 pole_GR 0.186425 pole_GR 0.315826 pole_GR 0.679106 pole_GR 1.64209 pole_GR 4.20254 pole_GR 11.2153 pole_GR 32.0724 pole_GR 110.326 pole_GR 764.415 # CHECK: f(1.000000e-15) = 7.952707e-01 = 7.952707e-01? # Rational function for FA y_FA -1 0 0 0 z_FA 8 8 8 8 m_FA 0.2 99.9 99.9 99.9 order_FA 9 Loading order 9 rational function approximation for FA: f(x) = (x+40.200000^2)^(-1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_FA 0.365929 res_FA 0.0109316 res_FA 0.0292903 res_FA 0.0678756 res_FA 0.157093 res_FA 0.370027 res_FA 0.906451 res_FA 2.45234 res_FA 8.71503 res_FA 75.8971 pole_FA 99.9 pole_FA 0.178875 pole_FA 0.290944 pole_FA 0.612037 pole_FA 1.46486 pole_FA 3.72845 pole_FA 9.8933 pole_FA 27.9493 pole_FA 92.735 pole_FA 546.062 Loading rational function parameters for phi field 4 # CHECK: f(1.000000e-15) = 1.257433e+00 = 1.257433e+00? naik_term_epsilon -0.15148 # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (7,7) # Approximating the function (x+40.500000^2)^(1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) # Converged at 280 iterations, error = 4.040987e-09 # Rational function for MD y_MD -1 0 0 0 z_MD 4 4 4 4 m_MD 0.5 99.9 99.9 99.9 order_MD 7 Loading order 7 rational function approximation for MD: f(x) = (x+40.500000^2)^(-1/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) (x+499.900000^2)^(0/4) res_MD 0.13325 res_MD 0.126692 res_MD 0.273789 res_MD 0.55645 res_MD 1.18659 res_MD 2.78588 res_MD 8.31586 res_MD 53.2089 pole_MD 99.9 pole_MD 1.09963 pole_MD 1.77462 pole_MD 3.7756 pole_MD 9.34147 pole_MD 25.7022 pole_MD 83.6423 pole_MD 462.526 # CHECK: f(1.000000e-15) = 1.000000e+00 = 1.000000e+00? # New rational function # Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of arithmetic is 75 # Degree of the approximation is (9,9) # Approximating the function (x+40.500000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) # Converged at 361 iterations, error = 9.642632e-12 # Rational function for GR y_GR 1 0 0 0 z_GR 8 8 8 8 m_GR 0.5 99.9 99.9 99.9 order_GR 9 Loading order 9 rational function approximation for GR: f(x) = (x+40.500000^2)^(1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_GR 2.89197 res_GR -0.0235578 res_GR -0.0828855 res_GR -0.217646 res_GR -0.544793 res_GR -1.38607 res_GR -3.77659 res_GR -12.1391 res_GR -58.589 res_GR -1030.78 pole_GR 99.9 pole_GR 1.10406 pole_GR 1.57368 pole_GR 2.71539 pole_GR 5.27759 pole_GR 11.0584 pole_GR 24.7339 pole_GR 60.9958 pole_GR 187.723 pole_GR 1219.23 # CHECK: f(1.000000e-15) = 1.000000e+00 = 1.000000e+00? # Rational function for FA y_FA -1 0 0 0 z_FA 8 8 8 8 m_FA 0.5 99.9 99.9 99.9 order_FA 9 Loading order 9 rational function approximation for FA: f(x) = (x+40.500000^2)^(-1/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) (x+499.900000^2)^(0/8) res_FA 0.345785 res_FA 0.0359642 res_FA 0.0870126 res_FA 0.174625 res_FA 0.347156 res_FA 0.709395 res_FA 1.54168 res_FA 3.82992 res_FA 13.0407 res_FA 112.659 pole_FA 99.9 pole_FA 1.0747 pole_FA 1.48735 pole_FA 2.51677 pole_FA 4.83418 pole_FA 10.0472 pole_FA 22.2737 pole_FA 54.0493 pole_FA 159.625 pole_FA 875.531 Maximum rational func order is 11 Naik term correction structure of multi_x: n_naiks 2 n_pseudo_naik[0]=4 n_orders_naik[0]=30 eps_naik[0]=0.000000 n_pseudo_naik[1]=1 n_orders_naik[1]=7 eps_naik[1]=-0.151480 n_order_naik_total 37 LAYOUT = Hypercubes, options = hyper_prime, automatic hyper_prime layout ON EACH NODE (RANK) 32 x 32 x 32 x 64 Mallocing 3288.3 MBytes per node for lattice Disabling GPU-Direct RDMA access Enabling peer-to-peer copy engine and direct load/store access QUDA 1.0.0 (git v0.9.0-3434-g2275f54f4-sm_80) CUDA Driver version = 11000 CUDA Runtime version = 11000 Found device 0: A100-SXM4-40GB Using device 0: A100-SXM4-40GB WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU) WARNING: Using device memory pool allocator WARNING: Using pinned memory pool allocator Loaded 315 sets of cached parameters from /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/milc-benchmark/study3/qudatune.dgx/tunecache.tsv ERROR: Failed to register pinned memory of size 3288334336 (/global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/milc_interface.cpp:161 in qudaAllocatePinned()) (rank 0, host cgpu19, /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/targets/cuda/malloc.cpp:299 in pinnedmalloc()) last kernel called was (name=,volume=,aux=) Saving 315 sets of cached parameters to /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/milc-benchmark/study3/qudatune.dgx/tunecache_error.tsv

MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0 with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: STEP 1083969.0 ON cgpu19 CANCELLED AT 2020-10-14T16:20:17 srun: error: cgpu19: task 0: Killed srun: Terminating job step 1083969.0 ./run_milc-dgx.sh: END TIME: Wed Oct 14 16:20:18 PDT 2020 |

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1065#issuecomment-708711795, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXV4J5ZQA2NLGOFTQM3SKYXGPANCNFSM4SRDF7RQ.

dwdoerf commented 3 years ago

Evan, if you'd like to try to recreate this issue you can find my MILC benchmark at:

https://gitlab.com/NERSC/milc-benchmark/-/tree/quda-debug/quda-debug

I set up a quda-debug branch with the input scripts I'm using. It also has the slurm outputs from my runs.

weinbe2 commented 3 years ago

Thanks @dwdoerf , @detar , I'll be looking into this today.

@dwdoerf, in the meantime, can you please run one more test for me? In the file lib/targets/cuda/malloc.cpp, please modify line 299:

errorQuda("Failed to register pinned memory of size %zu (%s:%d in %s())\n", size, file, line, func);

to read:

errorQuda("Failed to register pinned memory of size %zu (%s:%d in %s()), CUDA error: %s\n", size, file, line, func, cudaGetErrorString(err));

Of course this won't solve the problem, but it'll give some more useful information. (FYI, we'll patch this error fix in more formally soon, because we should have had this in the first place...)

weinbe2 commented 3 years ago

Ah, and one more ask @dwdoerf, can you please add a call to nvidia-smi at the top of your runscript (before calling MILC, anyway)---I want to make sure I'm testing with the right CUDA version and driver, just in case. Thanks again!

dwdoerf commented 3 years ago

Evan, here's the output after your requested mods: https://gitlab.com/NERSC/milc-benchmark/-/blob/quda-debug/quda-debug/slurm-1086216.out

weinbe2 commented 3 years ago

Thanks @dwdoerf. I tested your workflow on a local machine based on your scripts (with decently close driver and CUDA versions, at least) and couldn't reproduce the error, so there are a few other things I want to try. The error in your output was:

OS call failed or operation not supported on this OS

I can't think of a specific reason why this error's coming up, so I want to try a few more tests (trying to verify that it's not because of something weird in QUDA) before maybe bothering the Cori admins.

Two tests for now:

Run a few QUDA unit tests with a matching volume. This is to try to exclude it being a QUDA-specific issue.

From [quda build directory]/tests, please run (modifying mpirun as appropriate)

export QUDA_ENABLE_TUNING=0 # No need to tune since these are memory tests
mpirun -np 1 ./blas_test --sdim 32 --tdim 64 --dslash-type asqtad --niter 10
mpirun -np 1 ./staggered_dslash_test --dslash-type asqtad --sdim 32 --tdim 64 --prec double --niter 10
mpirun -np 1 ./llfat_test --prec double --sdim 32 --tdim 64 --niter 10
mpirun -np 1 ./gauge_force_test --prec double --sdim 32 --tdim 64 --niter 10

This covers the common patterns in the MILC RHMC benchmark.

Query memory consumption during the run/at an error, which will need an interactive node (at least for part of it)

Querying memory usage during the run, so something along the lines of:

$ ./run-L32x64.sh > /dev/null &
$ while true; do free -h; sleep 1; done

I intentionally didn't 2>&1, since the error should spit to stderr, letting you know when to Ctrl+C out.

Querying memory at the error, which will require cuda-gdb. From the interactive node, run:

$ cuda-gdb
[drops into cuda-gdb shell]
> set cuda api_failures stop # break on API error
> shell ./run-L32x64.sh
[... wait until break ...]
> shell free -h
[... record output ...]
> quit

That should do it for now. Let me know if you have any questions/hit any snags, and thanks in advance!

dwdoerf commented 3 years ago

The llfat_test is failing. The other 3 tests ran fine. Note I tried this with one of our V100 nodes and llfat_test works fine. So it seems to be unique to our A100 DGX test bed.

cgpu19:tests$ srun -n 1 ./llfat_test --prec double --sdim 32 --tdim 64 --niter 10
[1602837668.685606] [cgpu19:44335:0]    ucp_context.c:1028 UCX  ERROR exceeded transports/devices limit (69 requested, up to 64 are supported)
Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
Rank order is column major (t running fastest)
running the following test:
link_precision           link_reconstruct           space_dimension        T_dimension       Ordering
double                       18                         32/32/32/                  64             milc 
Grid partition info:     X  Y  Z  T
                         0  0  0  0
QUDA 1.0.0 (git v0.9.0-3434-g2275f54f4-sm_80)
CUDA Driver version = 11000
CUDA Runtime version = 11000
Found device 0: A100-SXM4-40GB
Using device 0: A100-SXM4-40GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
WARNING: Autotuning disabled
ERROR: Failed to register pinned memory of size 456855552 (llfat_test.cpp:62 in llfat_test()), CUDA error: OS call failed or operation not supported on this OS
 (rank 0, host cgpu19, /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/targets/cuda/malloc.cpp:299 in pinned_malloc_())
       last kernel called was (name=,volume=,aux=)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1088184.3 ON cgpu19 CANCELLED AT 2020-10-16T01:41:11 ***
srun: error: cgpu19: task 0: Killed
srun: Terminating job step 1088184.3

The OS on both our V100 and A100 platforms is OpenSUSE:

cgpu19:tests$ lsb-release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Leap 15.0
Release:    15.0
Codename:   n/a

weinbe2 commented 3 years ago

Thank you Doug. This result does seem reasonable, llfat_test allocates far more memory than the other benchmark, comparable to what MILC allocates, so I'm not surprised to see that test in particular is triggering the issue.

I put together a decently minimal reproducer below. It should compile without issue with nvcc. It takes in one command line argument, an allocation size in bytes. Give it a whirl on the A100 platform, sweeping from 100 MiB to 8 GiB in multiples of 100 MiB to pinpoint the problemed size. If we see this does successfully reproduce the issue above some allocation size, the next step may be to forward it along to the appropriate sysadmins (and from there we can try to loop in the right NVIDIA folk as appropriate), and take it from there.

// Simple reproducer of the pinned memory issue reported in
// https://github.com/lattice/quda/issues/1065

#include <unistd.h>   // for getpagesize()
#include <stdlib.h>
#include <iostream>

// Aligned malloc, based on 
// https://github.com/lattice/quda/blob/develop/lib/targets/cuda/malloc.cpp#L144
// Returns a pointer allocated to the page size, along with the (padded) buffer size

// for some quirky reason my compiler complained if I didn't have a separate declaration/definition...
void *aligned_malloc(size_t* base_size, const size_t size);

void *aligned_malloc(size_t* base_size, const size_t size) {
  void *ptr = nullptr;

  static int page_size = 2 * getpagesize();
  *base_size = ((size + page_size - 1) / page_size) * page_size; // round up to the nearest multiple of page_size
  int align = posix_memalign(&ptr, page_size, *base_size);
  if (!ptr || align != 0) {
    std::cerr << "Failed to allocate aligned host memory of size " << size << "\n";
    exit(-1);
  }
  return ptr;
}

// Routine to allocate pinned (page-locked) memory based on
// https://github.com/lattice/quda/blob/develop/lib/targets/cuda/malloc.cpp#L292
void *pinned_malloc(const size_t size)
{
  size_t base_size = 0; // overriden by padded size
  void *ptr = aligned_malloc(&base_size, size);

  cudaError_t err = cudaHostRegister(ptr, base_size, cudaHostRegisterDefault);
  if (err != cudaSuccess) {
    std::cerr << "Failed to register pinned memory of size " << base_size << " , padded from " << size << ", CUDA error: " << cudaGetErrorString(err) << "\n";
    ptr = nullptr;
  } else {
    std::cout << "Successful allocation of size " << base_size << " , padded from " << size << "\n";
  }
  return ptr;
}

int main(int argc, char** argv)
{
  if (argc != 2) {
    std::cerr << "Expects one argument, the pinned allocation size." << "\n";
    return -1;
  }

  const int size = atoi(argv[1]);
  void* allocation = pinned_malloc(size);

  // if we did make it here, clean up.
  if (allocation != nullptr) {
    cudaHostUnregister(allocation);
    free (allocation);
  }

  return 0;
}

dwdoerf commented 3 years ago

Evan, I've run your reproducer and it fails somewhere between 2GiB and (2GiB + 32MiB). So I assume this limit is some sort of Linux configuration parameter? What shall I ask the Cori GPU admins to do?

FYI, I had to change the type of the variable size in main to size_t, and change atoi() to atol() to keep size from overflowing for large memory sizes.

weinbe2 commented 3 years ago

Thanks for the update, Doug. (And sorry about the typo---it's almost poetic that I made an integer overflow mistake...)

As an update on my end, I'm circling back with some folks internally and I'll get back to you soon.

weinbe2 commented 3 years ago

@dwdoerf FYI, Max Katz (NVIDIA) hopped on the Cori A100 machine you've been using and has also reproduced the issue (not a surprise, but a good sanity check). Investigations are ongoing.

dwdoerf commented 3 years ago

@weinbe2 Thanks for the update. Perfect, putting it Max's hands!

maxpkatz commented 3 years ago

@dwdoerf I've been looking into this for the past few days. While I was investigating this I turned up some configuration issues on cgpu that needed to be resolved, so that was just done by CSG, but unfortunately they turned out not to be the culprit here. I've filed an internal NVIDIA ticket to discuss this with the CUDA team. I'll provide an update when I've got one.

dwdoerf commented 3 years ago

Hi Max, thanks for the status update!

Doug

On Oct 26, 2020, at 11:34 AM, Max Katz notifications@github.com wrote:

@dwdoerf https://github.com/dwdoerf I've been looking into this for the past few days. While I was investigating this I turned up some configuration issues on cgpu that needed to be resolved, so that was just done by CSG, but unfortunately they turned out not to be the culprit here. I've filed an internal NVIDIA ticket to discuss this with the CUDA team. I'll provide an update when I've got one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1065#issuecomment-716709462, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBYAA4P3V6GVDFAEI5X6S3SMWXJ3ANCNFSM4SRDF7RQ.

dwdoerf commented 3 years ago

Hi @maxpkatz just checking to see if there's been any progress on a resolution?

weinbe2 commented 3 years ago

Hey @dwdoerf --- I can't speak for work on resolving the exact issue as Max is the one shepherding it along, but I realized there's a potentially straightforward work-around (and in that regard I'm sorry I didn't think of it sooner). In short it's swapping pinned host memory allocations for managed memory allocations, which preserves the plumbing requirement of being able to dereference the pointer on both the host and device. While this will imply a performance hit, it will unblock your tests and will hopefully be an effect on the margins.

In the file lib/milc_interface.cpp:

https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L164 --- replace pool_pinned_malloc with managed_malloc
https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L166 --- replace pool_pinned_free with managed_free

Of course, it's possible that this approach will also hit an issue... but it's a quick test, at least.

dwdoerf commented 3 years ago

Hi Evan, I made the above change, but I get the following error:

<clip>
MAKING PATH TABLES
Combined fattening and long-link calculation time: 0.232823
Combined fattening and long-link calculation time: 0.203315
FLTIME: time = 1.456096e+00 (HISQ QUDA D) mflops = 9.133244e+04
Symanzik 1x1 + 1x2 + 1x1x1 action with HISQ quark loops
gauge_action: total_dyn_flavors = 4
loop coefficients: nloop rep loop_coeff  multiplicity
                    0 0      1.000000e+00     6
                    1 0      -3.316969e-02     12
                    2 0      2.809994e-03     16
WARMUPS COMPLETED
Omelyan integration, 3 gauge for one 1 fermion step, steps= 2 eps= 3.330000e-02 alpha= 1.000000e-01 beta= 1.000000e-01
ERROR: cudaHostGetDevicePointer failed with error invalid argument (cuda_gauge_field.cpp:583 in copy() (rank 0, host cgpu19, /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/milc_qcd/quda/lib/targets/cuda/malloc.cpp:515 in get_mapped_device_pointer_())
       last kernel called was (name=cudaMemset,volume=bytes=335544320,aux=cudaGaugeField,cuda_gauge_field.cpp,54)

weinbe2 commented 3 years ago

Sorry about that, investigating now. I have the cycles to make sure I can get a full run through. I'm encouraged to see the allocation went through, at least.

weinbe2 commented 3 years ago

Hey @dwdoerf , I just pushed a temporary branch hotfix/cori-hack-do-not-merge .

Using this branch I successfully ran the NERSC Small RHMC benchmark, one and two GPU runs, though admittedly it was on Ubuntu+Volta. Please give it a try. (FYI: no guarantee it'd run multi-node due to lack of testing capacity at the moment, but I don't think that's something you're testing anyway?)

weinbe2 commented 3 years ago

Reported fixed by @rgayatri23, closing.

lattice / quda

ERROR: Failed to register pinned memory #1065

Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of

Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of

Approximation bounds are [1.000000e-15,9.000000e+01] # Precision of