huttered40 / capital

Distributed-memory implementations of novel Cholesky and QR matrix factorizations
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Runtime error on Stampede2 #16

Closed huttered40 closed 4 years ago

huttered40 commented 5 years ago

For variant m=524288,n=2048,c=16,ppn=64,tpr=1 on both 64 nodes and 128 nodes, it essentially hangs after a strange error:

On 64 nodes, I get the following error statements written to the .e file:

402-002.stampede2.tacc.utexas.edu.2596Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c401-091.stampede2.tacc.utexas.edu.152146Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c416-003.stampede2.tacc.utexas.edu.170810Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c402-002.stampede2.tacc.utexas.edu.2606Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c416-041.stampede2.tacc.utexas.edu.152331Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c409-084.stampede2.tacc.utexas.edu.259280Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c403-124.stampede2.tacc.utexas.edu.79032Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)
c412-094.stampede2.tacc.utexas.edu.196940Received eager message(s) ptype=0x1 opcode=0xc9 from an unknown process (err=49)

On 128 nodes, it just hangs with no error message.

The corresponding script files are the following:

For 64 nodes:

#!/bin/bash
#SBATCH -J commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1_64nodes_64ppn_1tpr
#SBATCH -o commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1_64nodes_64ppn_1tpr.o
#SBATCH -e commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1_64nodes_64ppn_1tpr.e
#SBATCH -p normal
#SBATCH -N 64
#SBATCH -n 4096
#SBATCH -t 00:15:00
export MKL_NUM_THREADS=1
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 2 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+2+0+3+1+64+1+64+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 4 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+4+0+3+1+64+1+64+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 8 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+8+0+3+1+64+1+64+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 16 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+16+0+3+1+64+1+64+critter

and for 128 nodes:

#!/bin/bash
#SBATCH -J commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1_128nodes_64ppn_1tpr
#SBATCH -o commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1_128nodes_64ppn_1tpr.o
#SBATCH -e commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1_128nodes_64ppn_1tpr.e
#SBATCH -p normal
#SBATCH -N 128
#SBATCH -n 8192
#SBATCH -t 00:15:00
export MKL_NUM_THREADS=1
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 2 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+2+0+3+1+64+1+128+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 4 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+4+0+3+1+64+1+128+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 8 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+8+0+3+1+64+1+128+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 16 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Oct-20-0959PM-2019_STAMPEDE2_round1/data/camfs_cacqr2+0+524288+2048+16+0+3+1+64+1+128+critter
huttered40 commented 5 years ago

The memory footprint for 64 nodes was at least: (mn/dc) *64=268435456 (for a single matrix). Now this is the same memory footprint as that for c=8 on 32 nodes, which makes me think its not running out of memory (although it is close).

Haven't I seen this before? Using more memory per node causes problems with more nodes? I'm not really sure why this makes sense though.

huttered40 commented 4 years ago

I'm almost certain this is the same exact bug as #23.