Closed dwimjpurnomo closed 2 months ago
Thanks Dwi, if you attach your input deck I'll run locally to see if I can reproduce the error.
Hi Chris,
The input decks are here https://drive.google.com/drive/folders/1rytDfsPmxqwUDsfNeiydxyta_V4R4ysX?usp=sharing
Hi Chris,
This is the input decks for the new version of ELMFIRE. inputSavio.zip
I tested that in savio, it works with 1 node but not working with multinodes
Thanks, Dwi. I'm not sure why you're encountering the behavior that you're seeing, could it be caused by running from a directory that's not shared across nodes? But since 99.9% of the time (at least for me!) ELMFIRE is run on a single node, I'm going to leave this issue open so that I can eventually get to it, but put it on hold for now.
Hi @dwimjpurnomo, I finally found some time to look into this - thanks for your patience.
With the most recent code, inter-node runs are working on my end including generation of all outputs, so try re-running after building from the latest repository. If you haven't done this already, also be sure to set MULTIPLE_HOSTS = .TRUE.
in the &SIMULATOR
namelist group.
One more thing: when calling mpirun (Open MPI 4.1.2) I had pass it the flag --mca btl_tcp_if_exclude 172.17.0.0/24,127.0.0.0/24
to get it to work, but that may be unique to my local environment.
Thank you, Chris. I will try this.
Best,
Dwi
Regards Dwi Marhaendro J Purnomo
On Sun, 8 Sept 2024 at 13:40, Chris Lautenberger @.***> wrote:
Hi @dwimjpurnomo https://github.com/dwimjpurnomo, I finally found some time to look into this - thanks for your patience.
With the most recent code, inter-node runs are working on my end including generation of all outputs, so try re-running after building from the latest repository. If you haven't done this already, also be sure to set MULTIPLE_HOSTS = .TRUE. in the &SIMULATOR namelist group.
One more thing: when calling mpirun (Open MPI 4.1.2) I had pass it the flag --mca btl_tcp_if_exclude 172.17.0.0/24,127.0.0.0/24 to get it to work, but that may be unique to my local environment.
— Reply to this email directly, view it on GitHub https://github.com/lautenberger/elmfire/issues/57#issuecomment-2336817844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJC32S254WOF5SNDMW2GXCTZVSY3ZAVCNFSM6AAAAABEF4ASBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZWHAYTOOBUGQ . You are receiving this because you were mentioned.Message ID: @.***>
Its working now. Thank you!!!
Hi Chris
When I run ELMFIRE on savio using multiple nodes (each nodes has 20 cores), weird thing happened. Suddenly it created many .bin files but not the .bil and .hdr files. The created bin files start from a number until the last ensemble. Then the elmfire only run until the starting number - 1. The simulation somehow stopped and considered as complete with a message from "orted" in the job ID.
To illustrate: Lets say we use 100 ensembles. In one second .bin from 26 to 100 are already created but not the corresponding .bil and .hdr. Then elmfire continue running andreate the remaining ensemble (1 - 25). After the 25 ensembles are done, the simulation stop and considered as complete, without giving the .bil and .hdr outputs for ensemble 26 - 100. At the end of simulation the outputs created: .bin (1 - 100) .hdr (1 - 25) .bil (1 -25)
This doesn't happen when using 1 node in SAVIO.