Closed sebhtml closed 11 years ago
A test with MPI I/O on 4096 MPI ranks work however:
/home/c/clumeq/sebhtml/software/Ray-on-IBM-Blue-Gene-Q/MPI-IO
#!/bin/sh
# @ job_name = Test-MPI-IO-1024nodes,4ranksPerNode,4096ranks-2013-03-25-1
# @ job_type = bluegene
# @ comment = ""
# @ output = $(job_name).$(Host).$(jobid).out
# @ error = $(job_name).$(Host).$(jobid).err
# @ bg_size = 1024
# @ wall_clock_limit = 00:10:00
# @ bg_connectivity = Torus
# @ queue
# the BGLOCKLESSMPIO_F_TYPE line is to tell MPIIO that we are using GPFS
# 0x47504653 is GPFS_SUPER_MAGIC
runjob --np 4096 --ranks-per-node=4 --cwd=$PWD \
--envs BGLOCKLESSMPIO_F_TYPE=0x47504653 \
: /home/c/clumeq/sebhtml/software/Ray-on-IBM-Blue-Gene-Q/MPI-IO/Test-MPI-IO
MPI_IO=y works well else where too.
Another job submitted. bgq-fen1-$ llq -b Id Owner Submitted LL JS BS Block Size
bgq-fen1-ib0.1854.0 sebhtml 4/9 07:46 I
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted
Evaluation: 7 human-hours
Hello,
This morning, I ran a separate test on bgq-fen1 with the program MPI-Test-IO [1]
The program itself: https://github.com/sebhtml/Ray-on-IBM-Blue-Gene-Q/blob/master/MPI-IO/Test-MPI-IO.cpp
The command: https://github.com/sebhtml/Ray-on-IBM-Blue-Gene-Q/blob/master/MPI-IO/Test-1024-nodes-4096-ranks.sh
It rans fine and everything (with a ~20 GiB file in output)..
So clearly the problem is in Ray.
I then digged in the log of the job SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out and I found that offsets are buggy, probably because of an overflow (the white spruce is my first assembly that exceeds 2^32 - 1 in size !
Here is what I found:
bgq-fen1-$ grep "is appending its fusions" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out | head Rank 3840 is appending its fusions at 25656841821 Rank 1448 is appending its fusions at 25329147991 Rank 3508 is appending its fusions at 25775185315 Rank 692 is appending its fusions at 25725624865 Rank 80 is appending its fusions at 25555250262 Rank 3796 is appending its fusions at 25379490566 Rank 780 is appending its fusions at 25542351662 Rank 104 is appending its fusions at 25706873263 Rank 2212 is appending its fusions at 25719291698 Rank 552 is appending its fusions at 25580953486
It just makes no sense that the Rank 80 starts at 25555250262 (25 GB) since the assembly size is around 18-20 GB and that the offset of Rank 80 is expected to be around 80.0/4096 * 20.
I'll fix this and let you know how it goes.
Also, we have an intern in our team and he improved a lot the checkpointing code so that the write operations are grouped in chunks of 16 MiB instead of just a few bytes.
[1] https://github.com/sebhtml/Ray-on-IBM-Blue-Gene-Q/tree/master/MPI-IO
Rank 0 should write at 0
bgq-fen1-$ grep "bytes for storage" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out | grep "Rank 0 " Rank 0 requires 6190252 bytes for storage.
bgq-fen1-$ grep "appending its fusions at" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out |grep "Rank 0 " Rank 0 is appending its fusions at 25787910311
bgq-fen1-$ grep "appending its fusions at" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out |grep "Rank 1 " Rank 1 is appending its fusions at 25056587131
Code path:
The bytes computed by rank 80 is correct. The offset used by rank 80 is incorrect.
Hypotheses:
This is running on bgq in Toronto (1024 nodes, 4096 MPI ranks)
bgq-fen1-$ pwd /home/c/clumeq/sebhtml/scratch/projects/white-spruce bgq-fen1-$ tail -f SRA056234-Picea-glauca-2013-05-11-3.bgq-fen1.2267.out
Debug messages will tell me what's going on in there.
[DEBUG] Rank 12 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25126581497 [DEBUG] Rank 20 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25177622657 [DEBUG] Rank 16 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25152034066 [DEBUG] Rank 4 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25075830654 [DEBUG] Rank 128 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25120183417 [DEBUG] Rank 140 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25198049872 [DEBUG] Rank 144 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25223356931 [DEBUG] Rank 148 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25248030487 [DEBUG] Rank 64 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25455005682 [DEBUG] Rank 80 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25555250262
The values are the same across jobs.
bgq-fen1-$ grep appending SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out|grep "Rank 80 " Rank 80 is appending its fusions at 25555250262
bgq-fen1-$ grep appending SRA056234-Picea-glauca-2013-05-11-3.bgq-fen1.2267.out|grep "Rank 80 " [DEBUG] Rank 80 is appending its fusions at 25555250262
This is a case of buffer reuse:
bgq-fen1-$ grep 1ec8acd0a0 SRA056234-Picea-glauca-2013-05-12-4.bgq-fen1.2269.out.debug | head [DEBUG] Rank 0 sending offset 0 to rank 0 buffer from RingAllocator: 0x1ec8acd0a0 @0 0 [DEBUG] Rank 0 sending offset 736187552 to rank 117 buffer from RingAllocator: 0x1ec8acd0a0 @0 736187552 [DEBUG] Rank 0 sending offset 1474093978 to rank 234 buffer from RingAllocator: 0x1ec8acd0a0 @0 1474093978 [DEBUG] Rank 0 sending offset 2208207676 to rank 351 buffer from RingAllocator: 0x1ec8acd0a0 @0 2208207676 [DEBUG] Rank 0 sending offset 2946652749 to rank 468 buffer from RingAllocator: 0x1ec8acd0a0 @0 2946652749 [DEBUG] Rank 0 sending offset 3681973408 to rank 585 buffer from RingAllocator: 0x1ec8acd0a0 @0 3681973408 [DEBUG] Rank 0 sending offset 4419845224 to rank 702 buffer from RingAllocator: 0x1ec8acd0a0 @0 4419845224 [DEBUG] Rank 0 sending offset 5154628433 to rank 819 buffer from RingAllocator: 0x1ec8acd0a0 @0 5154628433 [DEBUG] Rank 0 sending offset 5895218040 to rank 936 buffer from RingAllocator: 0x1ec8acd0a0 @0 5895218040 [DEBUG] Rank 0 sending offset 6633204076 to rank 1053 buffer from RingAllocator: 0x1ec8acd0a0 @0 6633204076
// set the number of buffers to use
int minimumNumberOfBuffers=128;
So the messages must be sent in more than one pass.
Testing on
$ msub Ray-polytope-512-Roach.sh
10290413 $ pwd /rap/ihv-653-aa/assemblies
Probably fixed by this one:
https://github.com/sebhtml/RayPlatform/commit/d78e7ec5037c9c9e8a08160cb83864bbe67f658c