huttered40 / capital

Distributed-memory implementations of novel Cholesky and QR matrix factorizations
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Runtime error on Stampede2 #12

Closed huttered40 closed 5 years ago

huttered40 commented 5 years ago

Runtime error on Stampede2 with 256 nodes, 64 processes per node, and 1 thread-per-rank. The parameters and the batch script that failed:

#!/bin/bash
#SBATCH -J commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1_256nodes_64ppn_1tpr
#SBATCH -o commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1_256nodes_64ppn_1tpr.o
#SBATCH -e commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1_256nodes_64ppn_1tpr.e
#SBATCH -p normal
#SBATCH -N 256
#SBATCH -n 16384
#SBATCH -t 04:00:00
export MKL_NUM_THREADS=1
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 1 0 0 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/DataFiles/camfs_cacqr2+0+524288+2048+1+0+0+0+3+1+64+1+256+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 2 0 0 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/DataFiles/camfs_cacqr2+0+524288+2048+2+0+0+0+3+1+64+1+256+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 4 0 0 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/DataFiles/camfs_cacqr2+0+524288+2048+4+0+0+0+3+1+64+1+256+critter
ibrun /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/bin/camfs_cacqr2 524288 2048 8 0 0 0 3 /scratch/05608/tg849075/commcost_vs_ppn_Sep-23-0339PM-2019_STAMPEDE2_round1/DataFiles/camfs_cacqr2+0+524288+2048+8+0+0+0+3+1+64+1+256+critter

From the output file, all four variants seem to have failed. Note that all other hardware configurations, including 256 nodes and 8 ppn and 1 ppn ran correctly.

huttered40 commented 5 years ago

These variants did not fail when environment variable CRITTER_STATUS=ON. Very strange.

huttered40 commented 5 years ago

No. These variants did fail with the environment variable set. This behavior has been replicated twice.

huttered40 commented 5 years ago

For each of the 4 variants, the local matrix sizes are 32x2048, 128x1024, 1024x512, 8192x256.

The memory footprint of pDimC=8 is 8192x256x64=134217728, which is high and may cause an out-of-memory error.

The memory footprint of pDimC=1 is 32x2048x64=4194304, which is very small. There is no reason why that variant should fail.

huttered40 commented 5 years ago

I just launched a new critter job on Stampede2 with pDimC=1-4. So I left out the pDimC=8, which I fear may have been causing an out-of-memory error (will need to investigate this), but the other pDimC=1,2,4 variants at 256 nodes,64 ppn should work.

huttered40 commented 5 years ago

I just launched a new critter job on Stampede2 with pDimC=1-4. So I left out the pDimC=8, which I fear may have been causing an out-of-memory error (will need to investigate this), but the other pDimC=1,2,4 variants at 256 nodes,64 ppn should work.

Nothing failed with this job. Strange.

huttered40 commented 5 years ago

I launched this job again with variants c=2,4,8 only, and nothing failed. Weird.

I'll close this, but will be alert for any other failures.