sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Ray with MPI_IO=y produces a file with bytes valued to 0 on a Blue Gene/Q #175

Closed sebhtml closed 11 years ago

sebhtml commented 11 years ago

/home/c/clumeq/sebhtml/scratch/projects/white-spruce

bgq-fen1-$ cat SRA056234-Picea-glauca.sh
#!/bin/sh
# @ job_name           = SRA056234-Picea-glauca-2013-04-09-1
# @ job_type           = bluegene
# @ comment            = ""
# @ output             = $(job_name).$(Host).$(jobid).out
# @ error              = $(job_name).$(Host).$(jobid).err
# @ bg_size            = 1024
# @ wall_clock_limit   = 48:00:00
# @ bg_connectivity    = Torus
# @ queue 

# memory
#1024 * 16 = 16384 GiB

# cores
#  512 * 16 =  8192
#1024 * 16 = 16384
#  768 * 16 = 12288
#1024 * 16 = 16384

# the BGLOCKLESSMPIO_F_TYPE line is to tell MPIIO that we are using GPFS
# 0x47504653 is  GPFS_SUPER_MAGIC

runjob --np 4096 --ranks-per-node=4 --cwd=$PWD \
 --envs BGLOCKLESSMPIO_F_TYPE=0x47504653 \
: /home/c/clumeq/sebhtml/software/ray/Last-Build/Ray SRA056234-Picea-glauca.conf
sebhtml commented 11 years ago

A test with MPI I/O on 4096 MPI ranks work however:

/home/c/clumeq/sebhtml/software/Ray-on-IBM-Blue-Gene-Q/MPI-IO


#!/bin/sh
# @ job_name           = Test-MPI-IO-1024nodes,4ranksPerNode,4096ranks-2013-03-25-1
# @ job_type           = bluegene
# @ comment            = ""
# @ output             = $(job_name).$(Host).$(jobid).out
# @ error              = $(job_name).$(Host).$(jobid).err
# @ bg_size            = 1024
# @ wall_clock_limit   = 00:10:00
# @ bg_connectivity    = Torus
# @ queue 

# the BGLOCKLESSMPIO_F_TYPE line is to tell MPIIO that we are using GPFS
# 0x47504653 is  GPFS_SUPER_MAGIC

runjob --np 4096 --ranks-per-node=4 --cwd=$PWD \
 --envs BGLOCKLESSMPIO_F_TYPE=0x47504653 \
: /home/c/clumeq/sebhtml/software/Ray-on-IBM-Blue-Gene-Q/MPI-IO/Test-MPI-IO
sebhtml commented 11 years ago

MPI_IO=y works well else where too.

Another job submitted. bgq-fen1-$ llq -b Id Owner Submitted LL JS BS Block Size


bgq-fen1-ib0.1854.0 sebhtml 4/9 07:46 I

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted

sebhtml commented 11 years ago

Evaluation: 7 human-hours

sebhtml commented 11 years ago

Hello,

This morning, I ran a separate test on bgq-fen1 with the program MPI-Test-IO [1]

The program itself: https://github.com/sebhtml/Ray-on-IBM-Blue-Gene-Q/blob/master/MPI-IO/Test-MPI-IO.cpp

The command: https://github.com/sebhtml/Ray-on-IBM-Blue-Gene-Q/blob/master/MPI-IO/Test-1024-nodes-4096-ranks.sh

It rans fine and everything (with a ~20 GiB file in output)..

So clearly the problem is in Ray.

I then digged in the log of the job SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out and I found that offsets are buggy, probably because of an overflow (the white spruce is my first assembly that exceeds 2^32 - 1 in size !

Here is what I found:

bgq-fen1-$ grep "is appending its fusions" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out | head Rank 3840 is appending its fusions at 25656841821 Rank 1448 is appending its fusions at 25329147991 Rank 3508 is appending its fusions at 25775185315 Rank 692 is appending its fusions at 25725624865 Rank 80 is appending its fusions at 25555250262 Rank 3796 is appending its fusions at 25379490566 Rank 780 is appending its fusions at 25542351662 Rank 104 is appending its fusions at 25706873263 Rank 2212 is appending its fusions at 25719291698 Rank 552 is appending its fusions at 25580953486

It just makes no sense that the Rank 80 starts at 25555250262 (25 GB) since the assembly size is around 18-20 GB and that the offset of Rank 80 is expected to be around 80.0/4096 * 20.

I'll fix this and let you know how it goes.

Also, we have an intern in our team and he improved a lot the checkpointing code so that the write operations are grouped in chunks of 16 MiB instead of just a few bytes.


[1] https://github.com/sebhtml/Ray-on-IBM-Blue-Gene-Q/tree/master/MPI-IO

sebhtml commented 11 years ago

Rank 0 should write at 0

bgq-fen1-$ grep "bytes for storage" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out | grep "Rank 0 " Rank 0 requires 6190252 bytes for storage.

bgq-fen1-$ grep "appending its fusions at" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out |grep "Rank 0 " Rank 0 is appending its fusions at 25787910311

bgq-fen1-$ grep "appending its fusions at" SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out |grep "Rank 1 " Rank 1 is appending its fusions at 25056587131

sebhtml commented 11 years ago

Code path:

  1. rank 0 sends RAY_MPI_TAG_ASK_EXTENSION_DATA to rank 80 (the buffer includes the offset)
  2. rank 80 receives RAY_MPI_TAG_ASK_EXTENSION_DATA
  3. rank 80 sets its offset using the message
  4. rank 80 switches to RAY_SLAVE_MODE_SEND_EXTENSION_DATA (programmed behavior)

The bytes computed by rank 80 is correct. The offset used by rank 80 is incorrect.

Hypotheses:

  1. rank 80 does its slave mode before processing the message
  2. somehow the buffer does not contain the correct value.

This is running on bgq in Toronto (1024 nodes, 4096 MPI ranks)

bgq-fen1-$ pwd /home/c/clumeq/sebhtml/scratch/projects/white-spruce bgq-fen1-$ tail -f SRA056234-Picea-glauca-2013-05-11-3.bgq-fen1.2267.out

Debug messages will tell me what's going on in there.

sebhtml commented 11 years ago

[DEBUG] Rank 12 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25126581497 [DEBUG] Rank 20 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25177622657 [DEBUG] Rank 16 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25152034066 [DEBUG] Rank 4 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25075830654 [DEBUG] Rank 128 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25120183417 [DEBUG] Rank 140 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25198049872 [DEBUG] Rank 144 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25223356931 [DEBUG] Rank 148 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25248030487 [DEBUG] Rank 64 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25455005682 [DEBUG] Rank 80 call_RAY_MPI_TAG_ASK_EXTENSION_DATA received offset: 25555250262

sebhtml commented 11 years ago

The values are the same across jobs.

bgq-fen1-$ grep appending SRA056234-Picea-glauca-2013-04-09-2.bgq-fen1.1860.out|grep "Rank 80 " Rank 80 is appending its fusions at 25555250262

bgq-fen1-$ grep appending SRA056234-Picea-glauca-2013-05-11-3.bgq-fen1.2267.out|grep "Rank 80 " [DEBUG] Rank 80 is appending its fusions at 25555250262

sebhtml commented 11 years ago

This is a case of buffer reuse:

bgq-fen1-$ grep 1ec8acd0a0 SRA056234-Picea-glauca-2013-05-12-4.bgq-fen1.2269.out.debug | head [DEBUG] Rank 0 sending offset 0 to rank 0 buffer from RingAllocator: 0x1ec8acd0a0 @0 0 [DEBUG] Rank 0 sending offset 736187552 to rank 117 buffer from RingAllocator: 0x1ec8acd0a0 @0 736187552 [DEBUG] Rank 0 sending offset 1474093978 to rank 234 buffer from RingAllocator: 0x1ec8acd0a0 @0 1474093978 [DEBUG] Rank 0 sending offset 2208207676 to rank 351 buffer from RingAllocator: 0x1ec8acd0a0 @0 2208207676 [DEBUG] Rank 0 sending offset 2946652749 to rank 468 buffer from RingAllocator: 0x1ec8acd0a0 @0 2946652749 [DEBUG] Rank 0 sending offset 3681973408 to rank 585 buffer from RingAllocator: 0x1ec8acd0a0 @0 3681973408 [DEBUG] Rank 0 sending offset 4419845224 to rank 702 buffer from RingAllocator: 0x1ec8acd0a0 @0 4419845224 [DEBUG] Rank 0 sending offset 5154628433 to rank 819 buffer from RingAllocator: 0x1ec8acd0a0 @0 5154628433 [DEBUG] Rank 0 sending offset 5895218040 to rank 936 buffer from RingAllocator: 0x1ec8acd0a0 @0 5895218040 [DEBUG] Rank 0 sending offset 6633204076 to rank 1053 buffer from RingAllocator: 0x1ec8acd0a0 @0 6633204076

sebhtml commented 11 years ago
    // set the number of buffers to use
    int minimumNumberOfBuffers=128;

So the messages must be sent in more than one pass.

Testing on

$ msub Ray-polytope-512-Roach.sh

10290413 $ pwd /rap/ihv-653-aa/assemblies

sebhtml commented 11 years ago

Probably fixed by this one:

https://github.com/sebhtml/RayPlatform/commit/d78e7ec5037c9c9e8a08160cb83864bbe67f658c