Open SorooshMani-NOAA opened 2 years ago
AMD is picky. We used to get same problem on an AMD cluster using Intel compiler. Dan recently found it's related to the MPI implementation requiring a few changes in batch scripts: unlimit stack size and a parameter related to IntelMPI:
export UCX_UNIFIED_MODE=y
-Joseph
Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466
From: Soroosh Mani @.> Sent: Friday, August 19, 2022 4:07 PM To: schism-dev/schism @.> Cc: Subscribed @.***> Subject: [schism-dev/schism] Question: Is AMD processor + Intel compiler supported by SCHISM? (Issue #77)
[EXTERNAL to VIMS received message]
I'm trying this combination on ParallelWorks platform where they have AWS HPC6a instances (AMD) and I'm using the same Intel compilers (2021.3.0) that I used on Intel to run it, but the run doesn't go through, I get a segfault. So I was wondering if there are any known issues with this combination?
- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fschism-dev%2Fschism%2Fissues%2F77&data=05%7C01%7Cyjzhang%40vims.edu%7C6d6b6355ff1d410f4f7608da821e5189%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637965363971898484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wGdOP6RH0jSjJyTRNx51P2KyIdzXt5Cda0IZnT%2F6XIk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ3QZRKAI7PYS6A5B3LVZ7SMVANCNFSM57B2M54A&data=05%7C01%7Cyjzhang%40vims.edu%7C6d6b6355ff1d410f4f7608da821e5189%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637965363971898484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=N74Oo3e4RUWOriv31hl6noKENl6hUhvPgRHwTG5ejh8%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
Interesting! Any idea how the model performs in desktop grade AMD processors with GCC? To put it different way, is the performance is comparable between an Intel i7 and Ryzen 5 processors? Thanks.
@josephzhang8, should setting UCX_UNIFIED_MODE=y
at runtime fix the crash or there are other things I need to change as well?
Also:
ulimit -s unlimited
-Joseph
Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466
From: Soroosh Mani @.> Sent: Monday, August 22, 2022 8:37 AM To: schism-dev/schism @.> Cc: Y. Joseph Zhang @.>; Mention @.> Subject: Re: [schism-dev/schism] Question: Is AMD processor + Intel compiler supported by SCHISM? (Issue #77)
[EXTERNAL to VIMS received message]
@josephzhang8https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjosephzhang8&data=05%7C01%7Cyjzhang%40vims.edu%7Cc395babacda44ef8255808da843b11e0%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967686507115646%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MyRyvYc7zfFUuWVDOYWKykwIbe%2B1tqr8cKvu3nKAotM%3D&reserved=0, should setting UCX_UNIFIED_MODE=y at runtime fix the crash or there are other things I need to change as well?
- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fschism-dev%2Fschism%2Fissues%2F77%23issuecomment-1222300172&data=05%7C01%7Cyjzhang%40vims.edu%7Cc395babacda44ef8255808da843b11e0%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967686507115646%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dVytutIpc%2F3ZizalOzmXL8N4Tly5BVpYNl2yLage830%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZYYJXWWMOKFFAU2JPLV2NYALANCNFSM57B2M54A&data=05%7C01%7Cyjzhang%40vims.edu%7Cc395babacda44ef8255808da843b11e0%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967686507115646%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1Hf5UCl2Nvv2rM074otfYhfS%2FqNoxOR34cEHTrjmn%2B8%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>
I see, thank you
I still see the same issue on hpc6a
platform with the
limit -s unlimited
export UCX_UNIFIED_MODE=y
environment. I get the following error in my run logs: first one of the following lines for each core:
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
which I think is due to how the ParallelWorks environment is set up. And then one of these for each core
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
pschism_PAHM_TVD- 00000000006F71DA for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AFBEEFD8630 Unknown Unknown Unknown
libshm-fi.so 00002AFCFA21A98A Unknown Unknown Unknown
libshm-fi.so 00002AFCFA2078BE Unknown Unknown Unknown
libshm-fi.so 00002AFCFA2026B9 Unknown Unknown Unknown
libshm-fi.so 00002AFCFA202F23 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA08E31 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA11945 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA077A9 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA07865 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDB26E84 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDE1117B Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDE18094 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA0746A Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA7BAF0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA6616B Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA54748 MPI_Comm_dup Unknown Unknown
libmpifort.so.12. 00002AFBED4F260B pmpi_comm_dup_ Unknown Unknown
pschism_PAHM_TVD- 0000000000448D6E Unknown Unknown Unknown
pschism_PAHM_TVD- 0000000000410794 Unknown Unknown Unknown
pschism_PAHM_TVD- 00000000004106A2 Unknown Unknown Unknown
libc-2.17.so 00002AFBEF207555 __libc_start_main Unknown Unknown
pschism_PAHM_TVD- 00000000004105A9 Unknown Unknown Unknown
Looks like an MPI implementation issue. Not sure.
-Joseph
Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466
From: Soroosh Mani @.> Sent: Monday, August 22, 2022 12:41 PM To: schism-dev/schism @.> Cc: Y. Joseph Zhang @.>; Mention @.> Subject: Re: [schism-dev/schism] Question: Is AMD processor + Intel compiler supported by SCHISM? (Issue #77)
[EXTERNAL to VIMS received message]
I still see the same issue on hpc6a platform with the
limit -s unlimited
export UCX_UNIFIED_MODE=y
environment. I get the following error in my run logs: first one of the following lines for each core:
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
which I think is due to how the ParallelWorks environment is set up. And then one of these for each core
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
pschism_PAHM_TVD- 00000000006F71DA for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AFBEEFD8630 Unknown Unknown Unknown
libshm-fi.so 00002AFCFA21A98A Unknown Unknown Unknown
libshm-fi.so 00002AFCFA2078BE Unknown Unknown Unknown
libshm-fi.so 00002AFCFA2026B9 Unknown Unknown Unknown
libshm-fi.so 00002AFCFA202F23 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA08E31 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA11945 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA077A9 Unknown Unknown Unknown
libefa-fi.so 00002AFCFAA07865 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDB26E84 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDE1117B Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDE18094 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA0746A Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA7BAF0 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA6616B Unknown Unknown Unknown
libmpi.so.12.0.0 00002AFBEDA54748 MPI_Comm_dup Unknown Unknown
libmpifort.so.12. 00002AFBED4F260B pmpi_commdup Unknown Unknown
pschism_PAHM_TVD- 0000000000448D6E Unknown Unknown Unknown
pschism_PAHM_TVD- 0000000000410794 Unknown Unknown Unknown
pschism_PAHM_TVD- 00000000004106A2 Unknown Unknown Unknown
libc-2.17.so 00002AFBEF207555 __libc_start_main Unknown Unknown
pschism_PAHM_TVD- 00000000004105A9 Unknown Unknown Unknown
- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fschism-dev%2Fschism%2Fissues%2F77%23issuecomment-1222617660&data=05%7C01%7Cyjzhang%40vims.edu%7Caceb7b81850c47690ccf08da845d20ca%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967832770042489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I9VaZ7e7RWa0W6btA2WQQN%2BExxtIIoqSbtEQnk0lTjQ%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ2YHXDLWWAEOWEWLJ3V2OUSTANCNFSM57B2M54A&data=05%7C01%7Cyjzhang%40vims.edu%7Caceb7b81850c47690ccf08da845d20ca%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967832770042489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ex8VXPCdVdU6wB1%2Bjm96Rob7ErTP17FfCxDB%2F5wL7ag%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>
AMD+gcc should work; see example from Levante files.
Our experience so far suggests Intel (when properly implemented) still outperforms gcc.
-Joseph
Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466
From: Jamal Uddin Khan @.> Sent: Monday, August 22, 2022 5:52 AM To: schism-dev/schism @.> Cc: Y. Joseph Zhang @.>; Comment @.> Subject: Re: [schism-dev/schism] Question: Is AMD processor + Intel compiler supported by SCHISM? (Issue #77)
[EXTERNAL to VIMS received message]
Interesting! Any idea how the model performs in desktop grade AMD processors with GCC? To put it different way, is the performance is comparable between an Intel i7 and Ryzen 5 processors? Thanks.
- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fschism-dev%2Fschism%2Fissues%2F77%23issuecomment-1222119577&data=05%7C01%7Cyjzhang%40vims.edu%7C9d56f35604f746ff91c008da8423e51a%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967586956474325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C2xGQ71z94HusaD1eFtRzujBlrq3E8CqSeZJcSTcYhE%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ343NOTV4Z4ATGT5SLV2NESHANCNFSM57B2M54A&data=05%7C01%7Cyjzhang%40vims.edu%7C9d56f35604f746ff91c008da8423e51a%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C637967586956474325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bnK42gqSpNEzwih2NcPFT5wgJvDr6fWx0nYjQ1knKxs%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>
I'm trying this combination on ParallelWorks platform where they have AWS HPC6a instances (AMD) and I'm using the same Intel compilers (2021.3.0) that I used on Intel to run it, but the run doesn't go through, I get a segfault. So I was wondering if there are any known issues with this combination?