ufs-community / ufs-weather-model

UFS Weather Model
Other
140 stars 247 forks source link

Build jobs on Gaea are running under windfall queue #2217

Closed DusanJovic-NOAA closed 6 months ago

DusanJovic-NOAA commented 7 months ago

On Gaea all build regression test jobs are running on login nodes under windfall queue:

     JOBID    QOS PARTITION             NAME       USER ST       TIME  NODES   CPUS           START_TIME NODELIST(REASON)
 134863078 normal     batch rt_123922_contro Dusan.Jovi  R      20:18      2    256  2024-04-01T11:16:32 c5n[1526,1528]
 134863133 normal     batch rt_123922_cpld_c Dusan.Jovi  R      10:13      2    256  2024-04-01T11:26:37 c5n[0042,0101]
 134863130 normal     batch rt_123922_cpld_d Dusan.Jovi  R      11:32      3    384  2024-04-01T11:25:18 c5n[0033,0036-0037]
 134863167 normal     batch rt_123922_rap_ci Dusan.Jovi  R       1:43      2    256  2024-04-01T11:35:07 c5n[1496-1497]
 134863168 normal     batch rt_123922_rap_un Dusan.Jovi  R       1:43      2    256  2024-04-01T11:35:07 c5n[1524-1525]
 134863166 normal     batch rt_123922_rap_di Dusan.Jovi  R       1:53      2    256  2024-04-01T11:34:57 c5n[1494-1495]
 134863165 normal     batch rt_123922_rap_un Dusan.Jovi  R       2:17      2    256  2024-04-01T11:34:33 c5n[1438,1463]
 134863164 normal     batch rt_123922_hrrr_c Dusan.Jovi  R       2:20      2    256  2024-04-01T11:34:30 c5n[1436-1437]
 134863163 normal     batch rt_123922_hrrr_g Dusan.Jovi  R       2:40      2    256  2024-04-01T11:34:10 c5n[0116,0123]
 134863162 normal     batch rt_123922_rap_sf Dusan.Jovi  R       3:10      2    256  2024-04-01T11:33:40 c5n[0114-0115]
 134863161 normal     batch rt_123922_hrrr_c Dusan.Jovi  R       3:13      2    256  2024-04-01T11:33:37 c5n[0109-0110]
 134863160 normal     batch rt_123922_rap_co Dusan.Jovi  R       3:20      2    256  2024-04-01T11:33:30 c5n[0414-0415]
 134863159 normal     batch rt_123922_region Dusan.Jovi  R       3:37      1    128  2024-04-01T11:33:13 c5n0420
 134863177 normal     batch rt_123922_rap_fl Dusan.Jovi  R       0:30      2    256  2024-04-01T11:36:20 c5n[0030,0032]
 134863176 normal     batch rt_123922_rap_cl Dusan.Jovi  R       0:40      2    256  2024-04-01T11:36:10 c5n[0694-0695]
 134863175 normal     batch rt_123922_rrfs_v Dusan.Jovi  R       0:47      2    256  2024-04-01T11:36:03 c5n[0102,0122]
 134863174 normal     batch rt_123922_rap_no Dusan.Jovi  R       0:50      2    256  2024-04-01T11:36:00 c5n[0005,0021]
 134863173 normal     batch rt_123922_rap_sf Dusan.Jovi  R       1:01      2    256  2024-04-01T11:35:49 c5n[1707,1711]
 134863171 normal     batch rt_123922_rap_no Dusan.Jovi  R       1:09      2    256  2024-04-01T11:35:41 c5n[1535-1536]
 134863170 normal     batch rt_123922_rap_pr Dusan.Jovi  R       1:13      2    256  2024-04-01T11:35:37 c5n[0022,0026]
 134863169 normal     batch rt_123922_rap_ln Dusan.Jovi  R       1:32      2    256  2024-04-01T11:35:18 c5n[1531,1534]
  87774257 windfa eslogin_c compile_datm_cde Dusan.Jovi  R       2:09      1      8  2024-04-01T11:34:41 gaea56
  87774151 windfa eslogin_c compile_datm_cde Dusan.Jovi  R       4:03      1      8  2024-04-01T11:32:47 gaea58
  87774127 windfa eslogin_c compile_datm_cde Dusan.Jovi  R       5:58      1      8  2024-04-01T11:30:52 gaea53
  87774009 windfa eslogin_c compile_hafs_all Dusan.Jovi  R       8:06      1      8  2024-04-01T11:28:44 gaea54
  87774006 windfa eslogin_c compile_hafs_mom Dusan.Jovi  R       8:14      1      8  2024-04-01T11:28:36 gaea53
  87773992 windfa eslogin_c compile_hafsw_fa Dusan.Jovi  R       9:58      1      8  2024-04-01T11:26:52 gaea56
  87773652 windfa eslogin_c compile_hafsw_de Dusan.Jovi  R      12:07      1      8  2024-04-01T11:24:43 gaea55
  87773418 windfa eslogin_c compile_hafsw_in Dusan.Jovi  R      14:00      1      8  2024-04-01T11:22:50 gaea57
  87773008 windfa eslogin_c compile_rrfs_dyn Dusan.Jovi  R      17:00      1      8  2024-04-01T11:19:50 gaea52
  87772403 windfa eslogin_c compile_s2swa_de Dusan.Jovi  R      58:39      1      8  2024-04-01T10:38:11 gaea52

Is this intended?

zach1221 commented 7 months ago

Hi, @DusanJovic-NOAA . My RT builds are running under the normal partition. I'm using the epic account so it could be something to do with the account you're using perhaps, nggps_emc I assume? image

DusanJovic-NOAA commented 6 months ago

Maybe it's because of the account I'm using, I'm not sure. But It think the queue should be explicitly set in the compile job card template, now it is commented out:

$ grep QUEUE compile_slurm.IN_gaea
##SBATCH --qos=@[QUEUE]

I do not know why it is commented in the compile job card template but not in the run job card.

zach1221 commented 6 months ago

Maybe it's because of the account I'm using, I'm not sure. But It think the queue should be explicitly set in the compile job card template, now it is commented out:

$ grep QUEUE compile_slurm.IN_gaea
##SBATCH --qos=@[QUEUE]

I do not know why it is commented in the compile job card template but not in the run job card.

Ok, I will try setting it explicitly to "normal" in the compile job card and see how it goes when using nggps_emc.

zach1221 commented 6 months ago

@DusanJovic-NOAA so if I set qos to normal explicitly in compile_slurm.IN_gaea it fails if I'm using nggps_emc, with an invalid qos error. However, when using epic it works fine. Something about nggps requires qos to use windfall for builds. I've reached out to Gaea admins for insight.

zach1221 commented 6 months ago

@DusanJovic-NOAA I have not received a response back from Gaea yet, however I managed to get nggps_emc to compile within the 'normal' queue. I set cluster=c5 and partition=batch in compile_slurm.IN_gaea. I just think there is some account setting for nggps_emc to use windfall when ,clusters and partition are set to es and eslogin_c5, respectively. I'm not certain if there would be broader implications for setting these to c5/batch in compile_slurm.IN_gaea. image

zach1221 commented 6 months ago

Unless there are any protests, I will close this issue as I don't think there are any code changes to be made here.

DusanJovic-NOAA commented 6 months ago

So will this line:

https://github.com/ufs-community/ufs-weather-model/blob/b6c576d71b1bcfa8801e06faaa43dd970d62a471/tests/fv3_conf/compile_slurm.IN_gaea#L5

still be commented out?

zach1221 commented 6 months ago

If queue is not commented out the compilation will fail when using nggps_emc, because the queue will default to 'normal' as set by rt.sh for gaea, and it doesn't look like nggps_emc has access to the normal queue when on the es cluster/eslogin_c5 partition combination. So cluster and partition would also need to be changed to c5/batch.

DusanJovic-NOAA commented 6 months ago

If sysadmins are okay with us running compile jobs on login nodes then fine.

zach1221 commented 6 months ago

@DusanJovic-NOAA response from Gaea admins: The login nodes and, or the eslogin_c5 partition are intended for compilation and the maximum # of nodes allowed per job is one. You should be fine continuing to compile on the login nodes.