lautenberger / elmfire

Eulerian Level set Model of FIRE spread
https://elmfire.io
Eclipse Public License 2.0
30 stars 13 forks source link

Running elmfire on mulitple nodes #57

Closed dwimjpurnomo closed 2 months ago

dwimjpurnomo commented 8 months ago

Hi Chris

When I run ELMFIRE on savio using multiple nodes (each nodes has 20 cores), weird thing happened. Suddenly it created many .bin files but not the .bil and .hdr files. The created bin files start from a number until the last ensemble. Then the elmfire only run until the starting number - 1. The simulation somehow stopped and considered as complete with a message from "orted" in the job ID.

To illustrate: Lets say we use 100 ensembles. In one second .bin from 26 to 100 are already created but not the corresponding .bil and .hdr. Then elmfire continue running andreate the remaining ensemble (1 - 25). After the 25 ensembles are done, the simulation stop and considered as complete, without giving the .bil and .hdr outputs for ensemble 26 - 100. At the end of simulation the outputs created: .bin (1 - 100) .hdr (1 - 25) .bil (1 -25)

This doesn't happen when using 1 node in SAVIO.

lautenberger commented 8 months ago

Thanks Dwi, if you attach your input deck I'll run locally to see if I can reproduce the error.

dwimjpurnomo commented 8 months ago

Hi Chris,

The input decks are here https://drive.google.com/drive/folders/1rytDfsPmxqwUDsfNeiydxyta_V4R4ysX?usp=sharing

dwimjpurnomo commented 7 months ago

Hi Chris,

This is the input decks for the new version of ELMFIRE. inputSavio.zip

dwimjpurnomo commented 7 months ago

I tested that in savio, it works with 1 node but not working with multinodes

lautenberger commented 7 months ago

Thanks, Dwi. I'm not sure why you're encountering the behavior that you're seeing, could it be caused by running from a directory that's not shared across nodes? But since 99.9% of the time (at least for me!) ELMFIRE is run on a single node, I'm going to leave this issue open so that I can eventually get to it, but put it on hold for now.

lautenberger commented 2 months ago

Hi @dwimjpurnomo, I finally found some time to look into this - thanks for your patience.

With the most recent code, inter-node runs are working on my end including generation of all outputs, so try re-running after building from the latest repository. If you haven't done this already, also be sure to set MULTIPLE_HOSTS = .TRUE. in the &SIMULATOR namelist group.

One more thing: when calling mpirun (Open MPI 4.1.2) I had pass it the flag --mca btl_tcp_if_exclude 172.17.0.0/24,127.0.0.0/24 to get it to work, but that may be unique to my local environment.

dwimjpurnomo commented 2 months ago

Thank you, Chris. I will try this.

Best,

Dwi

Regards Dwi Marhaendro J Purnomo

On Sun, 8 Sept 2024 at 13:40, Chris Lautenberger @.***> wrote:

Hi @dwimjpurnomo https://github.com/dwimjpurnomo, I finally found some time to look into this - thanks for your patience.

With the most recent code, inter-node runs are working on my end including generation of all outputs, so try re-running after building from the latest repository. If you haven't done this already, also be sure to set MULTIPLE_HOSTS = .TRUE. in the &SIMULATOR namelist group.

One more thing: when calling mpirun (Open MPI 4.1.2) I had pass it the flag --mca btl_tcp_if_exclude 172.17.0.0/24,127.0.0.0/24 to get it to work, but that may be unique to my local environment.

— Reply to this email directly, view it on GitHub https://github.com/lautenberger/elmfire/issues/57#issuecomment-2336817844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJC32S254WOF5SNDMW2GXCTZVSY3ZAVCNFSM6AAAAABEF4ASBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZWHAYTOOBUGQ . You are receiving this because you were mentioned.Message ID: @.***>

dwimjpurnomo commented 2 months ago

Its working now. Thank you!!!