Closed huttered40 closed 5 years ago
These variants did not fail when environment variable CRITTER_STATUS=ON
. Very strange.
No. These variants did fail with the environment variable set. This behavior has been replicated twice.
For each of the 4 variants, the local matrix sizes are 32x2048, 128x1024, 1024x512, 8192x256.
The memory footprint of pDimC=8 is 8192x256x64=134217728, which is high and may cause an out-of-memory error.
The memory footprint of pDimC=1 is 32x2048x64=4194304, which is very small. There is no reason why that variant should fail.
I just launched a new critter job on Stampede2 with pDimC=1-4. So I left out the pDimC=8, which I fear may have been causing an out-of-memory error (will need to investigate this), but the other pDimC=1,2,4 variants at 256 nodes,64 ppn should work.
I just launched a new critter job on Stampede2 with pDimC=1-4. So I left out the pDimC=8, which I fear may have been causing an out-of-memory error (will need to investigate this), but the other pDimC=1,2,4 variants at 256 nodes,64 ppn should work.
Nothing failed with this job. Strange.
I launched this job again with variants c=2,4,8
only, and nothing failed. Weird.
I'll close this, but will be alert for any other failures.
Runtime error on Stampede2 with 256 nodes, 64 processes per node, and 1 thread-per-rank. The parameters and the batch script that failed:
From the output file, all four variants seem to have failed. Note that all other hardware configurations, including 256 nodes and 8 ppn and 1 ppn ran correctly.