ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

Having trouble running with UCX on WCOSS2 #2231

Open MatthewPyle-NOAA opened 1 month ago

MatthewPyle-NOAA commented 1 month ago

Description

Attempts to run using ucx rather than slingshot for an RRFS configuration have led to failures when the model begins to start integrating. The failures are similar in appearance to model instability failures, so seems like NaNs are getting into the system somehow.

To Reproduce:

Utilize the /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/job_card.sh job card (on dogwood) to run the case (will require copying the run_dir and config_parms directories to your own space). job_card.sh_nonucx is a job card that avoids ucx and works for me.

Additional context

Very open to the idea that it is user error on my part, but could use help figuring out why it is failing the way it is.

Output

ucx failure log file on Dogwood: /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/OUTPUT_60h_41nodes_retry_newucxtest_v0.8.9

MatthewPyle-NOAA commented 3 weeks ago

@GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!

GeorgeVandenberghe-NOAA commented 3 weeks ago

Do you have a WCOSS2 CWD with testcase, a job to run it and (possibly) source code and the build?

On Mon, Apr 22, 2024 at 5:15 PM MatthewPyle-NOAA @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2070294586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWQPSWBB4RRI2ZOO7LY6VAQ5AVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGI4TINJYGY . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

MatthewPyle-NOAA commented 3 weeks ago

Most details you need are described in the "To reproduce" part of the issue - I do have a test setup on dogwood. I've been pointing at RRFS model executables, but could point you at a source if needed.

GeorgeVandenberghe-NOAA commented 3 weeks ago

You can repair this in the ucx job by loading a later level of cray-mpich. When I do this the test job runs to timeout.

module load cray-mpich-ucx/8.1.12

module load cray-mpich-ucx/8.1.19

MatthewPyle-NOAA commented 3 weeks ago

Thanks @GeorgeVandenberghe-NOAA will give that a try!

MatthewPyle-NOAA commented 3 weeks ago

Have confirmed that going to cray-mpich-ucx/8.1.19 solves my issue....closing the issue.

junwang-noaa commented 2 weeks ago

@MatthewPyle-NOAA is there any issue with using UCX?

MatthewPyle-NOAA commented 2 weeks ago

@junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.

GeorgeVandenberghe-NOAA commented 2 weeks ago

I lost my testcase on dogwood after the problem was closed. Do you have a CWD and source on Cactus. ?

On Tue, Apr 30, 2024 at 12:50 PM MatthewPyle-NOAA @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2085244122, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FTZ3JDQGNPI7GMHRRTY76HPTAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGI2DIMJSGI . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

MatthewPyle-NOAA commented 2 weeks ago

@GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?

GeorgeVandenberghe-NOAA commented 2 weeks ago

The UCX stuff should be shared libraries and recompiling won't affect it. Do you have a source and build in that directory?

I'll go ahead and snag it. I had gotten rid of my testcases after the problem was closed.

On Tue, Apr 30, 2024 at 6:18 PM MatthewPyle-NOAA @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2086397856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQL6KYZN5M2QGARNIDY77N7VAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWGM4TOOBVGY . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

MatthewPyle-NOAA commented 2 weeks ago

Okay. I'm using cray-mpich/8.1.12 for the non-UCX test. Hopefully the level of cray-mpich doesn't explain the difference.