ufs-community / ufs-s2s-model

UFS sub-seasonal to seasonal forecast model. This repository was frozen in Oct 2020 and all development was moved to the ufs-weather-model repository.
GNU General Public License v3.0
7 stars 29 forks source link

cpld_fv3_ccpp_mom6_cice_cmeps_restart fails #164

Closed DeniseWorthen closed 3 years ago

DeniseWorthen commented 3 years ago

The current develop branch fails the cpld_fv3_ccpp_mom6_cice_cmeps_restart regression test on dell-p3.

This was first noted in the testing for the cmeps update (PR #163). In that branch, all the tests on Hera and Orion passed.

junwang-noaa commented 3 years ago

What is the error message for that test?

On Sun, Aug 16, 2020 at 12:39 PM Denise Worthen notifications@github.com wrote:

The current develop branch fails the cpld_fv3_ccpp_mom6_cice_cmeps_restart regression test on dell-p3.

This was first noted in the testing for the cmeps update (PR #163 https://github.com/ufs-community/ufs-s2s-model/pull/163). In that branch, all the tests on Hera and Orion passed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TMDJZHQWTRWKXBCS6TSBADUBANCNFSM4QA5E32Q .

DeniseWorthen commented 3 years ago

It fails at the model startup, in ice_gather_scatter.

SMoorthi-emc commented 3 years ago

FYI, yesterday I was able to run from a restart on dell and it did reproduce the continuous run. This was without the fractional grid and wave.

On Mon, Aug 17, 2020 at 8:53 AM Denise Worthen notifications@github.com wrote:

It fails at the model startup, in ice_gather_scatter.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-674863352, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYRT4LJNDLREY3RUHZTSBER5TANCNFSM4QA5E32Q .

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718

e-mail: Shrinivas.Moorthi@noaa.gov Phone: (301) 683-3718 Fax: (301) 683-3718

DeniseWorthen commented 3 years ago

The issue is whether the baseline reproduces itself.

I know you have seen this error whereas I almost never see it. But you run the coupled model on Dell and none of the rest of us do (other than baselines). So I wonder whether the issue is the cice compiler macro on Dell for CICE5?

SMoorthi-emc commented 3 years ago

I did not understand your last comment; However, I do have a compile related issue that I am still unable to understand. When the script "comp_ice.backend.libcice" runs on dell, the variable "SITE" does not come correctly - it comes as "MARS" or "VENUS", instead of "wcoss".

SMoorthi-emc commented 3 years ago

I guess I misunderstood CICE for CICE6. My comment was on CICE6 on dell. Please ignore my "FYI"

On Mon, Aug 17, 2020 at 9:00 AM Shrinivas Moorthi - NOAA Federal < shrinivas.moorthi@noaa.gov> wrote:

FYI, yesterday I was able to run from a restart on dell and it did reproduce the continuous run. This was without the fractional grid and wave.

On Mon, Aug 17, 2020 at 8:53 AM Denise Worthen notifications@github.com wrote:

It fails at the model startup, in ice_gather_scatter.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-674863352, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYRT4LJNDLREY3RUHZTSBER5TANCNFSM4QA5E32Q .

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718

e-mail: Shrinivas.Moorthi@noaa.gov Phone: (301) 683-3718 Fax: (301) 683-3718

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718

e-mail: Shrinivas.Moorthi@noaa.gov Phone: (301) 683-3718 Fax: (301) 683-3718

DeniseWorthen commented 3 years ago

I've tested this again today (only the cold,2d,3d and restart tests) without using ecflow and I got the same ice_gather_scatter failure in the restart test. The model fails at startup:

Screen Shot 2020-08-18 at 2 20 46 PM
junwang-noaa commented 3 years ago

I checked out the s2s develop branch and ran the restart test without using ecflow on dell. It ran successfully. I forgot to save the run directory. I am running it again, maybe we can compare the data from my run and yours.

On Tue, Aug 18, 2020 at 2:26 PM Denise Worthen notifications@github.com wrote:

I've tested this again today (only the cold,2d,3d and restart tests) without using ecflow and I got the same ice_gather_scatter failure in the restart test. The model fails at startup:

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-675640597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TJV7ZSE3A7UDCNBYU3SBLBT5ANCNFSM4QA5E32Q .

DeniseWorthen commented 3 years ago

Or our environments? Is there something I might not have set in my environment?

junwang-noaa commented 3 years ago

Not sure, we actually did module purge in job_card, it should not have any impact. Anyway, you can check my /u/Jun.Wang/.bashrc

On Tue, Aug 18, 2020 at 2:35 PM Denise Worthen notifications@github.com wrote:

Or our environments? Is there something I might not have set in my environment?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-675645170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TKWMHE7GRL5BCMC633SBLCXDANCNFSM4QA5E32Q .

DeniseWorthen commented 3 years ago

I tried again adding the following from your bashrc to mine; the restart test passed. I don't know what in this list below would have caused the difference.

if [ ! -z $MODULESHOME ]; then . $MODULESHOME/init/bash else . /opt/modules/default/init/bash fi

module load ips/18.0.1.163 module load impi/18.0.1 module load NetCDF/4.5.0 module load bacio/2.0.2 module load sfcio/1.0.0 module load lsf/10.1 module load nemsio/2.2.3 module load w3emc/2.3.0 module load sp/2.0.2 module load w3nco/2.0.6 module load impi/18.0.1 module load bufr/11.2.0 module load sigio/2.0.1 module load crtm/2.2.5

module load EnvVars/1.0.2 module load pm5/1.0 module load subversion/1.7.16 module load HPSS/5.0.2.5 module load mktgs/1.0 module load rsync/3.1.2 module load ip/3.0.1 module load prod_envir/1.0.2 module load grib_util/1.0.6

module use /gpfs/dell3/usrx/local/dev/emc_rocoto/modulefiles/ module load ruby/2.5.1 rocoto/1.2.4

module use -a /usrx/local/dev/modulefiles module load git/2.14.3 module load cmake/3.10.0

junwang-noaa commented 3 years ago

Denise, is the restart test running for you? I have the restart test run directory on mars at:

/gpfs/dell2/ptmp/Jun.Wang/S2S_RT/rt_196272/cpld_fv3_ccpp_mom6_cice_cmeps_restart

if you'd like to take a look

DeniseWorthen commented 3 years ago

It ran and passed if I included the items from your bashrc (above).

I then tried my cmeps update branch (which is where I first saw the failed test) and ran it again using ecflow.

This time, all the jobs passed but the RegressionTests_wcoss_dell_p3.log file is empty. All the .log files in log_wcoss_dell_p3 report 'pass' but the RegressionTest file was not created. Minsuk is taking a look at it now.

junwang-noaa commented 3 years ago

I'd like to confirm that other runs (cold, 2day, 3day runs) were all passed, just the restart test failed when you used your previous .bashrc file, right?

On Wed, Aug 19, 2020 at 8:31 AM Denise Worthen notifications@github.com wrote:

It ran and passed if I included the items from your bashrc (above).

I then tried my cmeps update branch (which is where I first saw the failed test) and ran it again using ecflow.

This time, all the jobs passed but the RegressionTests_wcoss_dell_p3.log file is empty. All the .log files in log_wcoss_dell_p3 report 'pass' but the RegressionTest file was not created. Minsuk is taking a look at it now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-676273847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TPUVJOZDPAKD3DBQCTSBPAZVANCNFSM4QA5E32Q .

DeniseWorthen commented 3 years ago

Yes, that is right. When I initially tested the upcmeps branch, it was only the restart test that failed. That made me test the develop branch and only the restart test failed. I added your bashrc items. I can't remember if I tried all tests or just the restart test at that point. But the restart passed for the develop branch. So then I went back to my updcmeps branch and all jobs passed (using ecflow), but the log file was not created.

DeniseWorthen commented 3 years ago

I've now re-run the updcmeps branch, on dell, using ecflow and all tests passed again. This time the log files were written. I don't know what caused the failure to write the RegressionTests_wcoss_dell_p3.log last time.

I will try the develop branch again (all tests) and make sure that it works and then close the issue if I don't see a problem.

junwang-noaa commented 3 years ago

Denise, would you please send me your run directory of the failed restart case on dell? It's interesting why the env setting only impacts this test.

On Wed, Aug 19, 2020 at 11:37 AM Denise Worthen notifications@github.com wrote:

I've now re-run the updcmeps branch, on dell, using ecflow and all tests passed again. This time the log files were written. I don't know what caused the failure to write the RegressionTests_wcoss_dell_p3.log last time.

I will try the develop branch again (all tests) and make sure that it works and then close the issue if I don't see a problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-676501418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TK6L5OY2HWFU7GSHITSBPWTRANCNFSM4QA5E32Q .

DeniseWorthen commented 3 years ago

I found the run that had the one failed test: /gpfs/dell2/ptmp/Denise.Worthen/S2S_RT/rt_39969

In this case the cold,2d,3d all passed.

DeniseWorthen commented 3 years ago

After @MinsukJi-NOAA suggested that I might need some sort of ecflow access added, I had an exchange w/ the wcoss help desk. This is his response:

This could be part of the problem? IBM will kill processes running on login nodes for >24 hours. ecflow_server processes should be started on dedicated ecflow nodes. If the process was killed then you would definitely have issues until it was restarted. development ecflow nodes are ldecflow1, ldecflow2, sdecflow1, sdecflow2, mdecflow1, mdecflow2, vdecflow1 and vdecflow2

Are we sure that the RT is running the right way on dell? I think @binli2337 you said you couldn't run using ecflow at all?

minsukji commented 3 years ago

@DeniseWorthen, my understanding is that each user is assigned a port number that can be used with ecflow. In the rt script, this port number is automatically used via the USER linux environment. Before I got my port number, ecflow jobs were unstable (sometimes runs, sometimes fails to start, sometimes fails in the middle of a job). As to the 'ecflow nodes', I did not know about it, and I don't believe it's utilized in rt.

DeniseWorthen commented 3 years ago

OK thanks. He also had said that all developers should have access to ecflow and wanted to know if I needed access to the ecflow software. I did not think so---but that maybe you did, which is why you needed to specifically make a request?

junwang-noaa commented 3 years ago

I never asked helpdesk on the ecflow porting number but I am running fine with ufs-weather ecflow, not sure if it is assigned automatically. Not sure if @Dusan Jovic - NOAA Affiliate dusan.jovic@noaa.gov knows more about this.

On Wed, Aug 19, 2020 at 12:10 PM Minsuk Ji notifications@github.com wrote:

@DeniseWorthen https://github.com/DeniseWorthen, my understanding is that each user is assigned a port number that can be used with ecflow. In the rt script, this port number is automatically used via the USER linux environment. Before I got my port number, ecflow jobs were unstable (sometimes runs, sometimes fails to start, sometimes fails in the middle of a job). As to the 'ecflow nodes', I did not know about it, and I don't believe it's utilized in rt.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-s2s-model/issues/164#issuecomment-676519755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TLIHWK25HXNIBDNCHLSBP2OVANCNFSM4QA5E32Q .

DeniseWorthen commented 3 years ago

I think I'm experiencing what Minsuk did--the behaviour is just flakey. Sometimes it works and the log file is created but other times it doesn't. I was able to run the develop branch and all the tests passed and it created the log files.

DeniseWorthen commented 3 years ago

I am going to mark this as a bug since we should be able to run ecflow on dell reliably.

DeniseWorthen commented 3 years ago

I am going to close this issue. We can create a new issue on ufs-weather if we see the same problem re-emerge.