ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

inefficient use of resources on derecho #2177

Open DeniseWorthen opened 6 months ago

DeniseWorthen commented 6 months ago

Description

Derecho has TPN=128 but in most jobs we are vastly under-using these nodes. For example, the cpld_control_p8 test needs 200 tasks but the job card requests

#PBS -l select=3:ncpus=96:mpiprocs=96:ompthreads=1"

On Gaea, the same test uses

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128

srun --label -n 256 ./fv3.exe

See related discussions here and here

zach1221 commented 5 months ago

Hello, @DeniseWorthen . I tested changing the TPN for the below cases to 128, and they all passed.

cpld_control_ciceC_p8, cpld_control_p8_faster_intel, cpld_control_p8_intel, cpld_control_p8_mixedmode_intel, cpld_control_qr_p8_intel, cpld_debug_p8_intel, cpld_decomp_p8_intel, cpld_mpi_p8_intel, cpld_restart_p8_intel , cpld_restart_qr_p8_intel, and merra2_thompson_intel.

The only one that fails with the new TPN is cpld_control_c192_p8_intel. I can update the job_cards for these cases to 128 TPN in one of the upcoming PRs.

DeniseWorthen commented 5 months ago

@zach1221 I really don't see why the C192 test would be any different than the other cases which do work. Can you point me to your run directory?

zach1221 commented 5 months ago

@zach1221 I really don't see why the C192 test would be any different than the other cases which do work. Can you point me to your run directory?

Sure. /glade/derecho/scratch/zshrader/FV3_RT/rt_39242/cpld_control_c192_p8_intel /glade/work/zshrader/ufs-weather-model/tests FATAL from PE 285: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

DeniseWorthen commented 5 months ago

@zach1221 I don't think the resources are set right in your job card? I changed to ppn 128 and it ran fine. How did you change the other job cards?

38c38
< mpiexec -n 788 -ppn 128 --hostfile $PBS_NODEFILE ./fv3.exe
---
> mpiexec -n 788 -ppn 113 --hostfile $PBS_NODEFILE ./fv3.exe
zach1221 commented 5 months ago

@DeniseWorthen ok, let me try again. I set it to 128 in the tests/tests/cpld_control_c192_p8 file, and it looked like job_card file generated had TPN set to 128.

DusanJovic-NOAA commented 5 months ago

See my question here:

https://github.com/ufs-community/ufs-weather-model/pull/1836#discussion_r1408387822

I never understood why the PPN is introduced and why it is needed. PPN is recomputed here:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/run_test.sh#L315-L319

and it is only used in derecho job_card template. And I still do not understand why it was needed only on derecho.

zach1221 commented 5 months ago

@zach1221 I don't think the resources are set right in your job card? I changed to ppn 128 and it ran fine. How did you change the other job cards?

38c38
< mpiexec -n 788 -ppn 128 --hostfile $PBS_NODEFILE ./fv3.exe
---
> mpiexec -n 788 -ppn 113 --hostfile $PBS_NODEFILE ./fv3.exe

You were right. It passes when included directly on the mpiexec command line. I'll rerun rt.conf again to ensure it doesn't cause any other issues.

DeniseWorthen commented 4 months ago

What is the status of a fix for this issue? It is pretty straightfoward.

zach1221 commented 4 months ago

Ok, so I re-confirmed again that setting the ppn to 128 explicitly does allow the cpld_control_c192_p8_intel case to pass but there are a host of other cpld, hafs_regional and atmaero cases that fail after consistently exceeding wall-clocks. I increased the times to as much as 2 hours but they continue to fail, we'll likely need to find an alternative. I set the ppn in the job_card so I'll see how changing it in default_vars performs.