pabloprf / MITIM-fusion

MITIM (MIT Integrated Modeling) Suite for Fusion Applications
https://mitim-fusion.readthedocs.io
MIT License
10 stars 4 forks source link

Slurm submission script requires memory specification #7

Closed jfparisi closed 1 month ago

jfparisi commented 9 months ago

On PPPL's spark cluster, specifying memory in the bash.src script is required, otherwise job fails. I recommend we add something like

    # ******* memory setup
    if memory_req is not None:
        commandSBATCH.append(f"#SBATCH --mem {memory_req}GB")

to FARMINGtools.py.

pabloprf commented 9 months ago

What's your suggestion to implement this? Should the users specify a memory required (e.g. a default) on their config_user.json file or should each model class (e.g. TGLF, TGYRO, CGYRO when available) send a memory specification as well?

jfparisi commented 9 months ago

My starting point would be to avoid introducing unnecessary complexity in config_user.jason, so if the cluster does not require --mem in bash.src, I recommend leaving it blank (if no mem issues are encountered on these clusters).

But if --mem is required, it's tricky, since differing simulation resolution parameters for even a single model class can have a wide range of RAM requirements.

A simple approach is requesting RAM proportional to CPU fraction occupied per node. This might slow computation slightly, but is easy to code. E.g. if the node has 128 GB RAM and 32 CPUs, then it has 4 GB per CPU. if we request 6 CPUs, let's also request 4 GB per CPU x 6 CPUs = 24 GB of RAM. Then, all that would be required in config_user.jason would be a field to indicate that slurm requires memory specification.

Another approach could be to start with a 'test' slurm submission with a low amount of RAM, and keep doubling until the required amount is found. This introduces coding complexity but is efficient.

pabloprf commented 9 months ago

MITIM at the moment only handles calls to simulations without big memory requirements (TGLF, TGYRO). TRANSP is handled through globus in PPPL cluster. CGYRO runs are handled externally to MITIM (soon @nthoward and I will provide instructions/tutorials on how to properly do the coupling PORTALS-CGYRO).

In the future we'd like MITIM to also handle CGYRO directly (@tema1992 is interested in this as well), so this issue you raise is important to start thinking about it early on.

We could have some logic hard-coded for the calculation of the memory requirements based on simulation resolution parameters, or we could make the user point to a file that contains the desired slurm setup. In fact, I'm not sure quite yet if we should also have the user point directly to a CGYRO input file (with controls parameters) instead of being handled by MITIM. If we do this, it may be convenient to also point to the slurm specification. I'd like to hear opinions on this logistics by @nthoward @jfparisi @tema1992 @cholland

cholland commented 9 months ago

Sorry for late reply. I don't have any particular insight for the specific slurm setup question, but perhaps looking at how something like gacode_qsub handles the issue. But I'd be wary of trying to automate nonlinear CGYRO runs too heavily, given the "expert user" input still needed on setting up grid domain and more importantly monitoring convergence. But its an interesting thing to start planning for, something we should discuss at for February SMARTS meeting.

pabloprf commented 9 months ago

@jfparisi The new main branch as of this morning (d7a1f787f80131c83fe97502139ad67b4cc49db1) should fix the issue you were having. The new config file has now the following structure:

{
   "preferences": {
      "tglf":             "engaging",
      "verbose_level":    "5",
      "dpi_notebook":     "80"
   },
   "engaging": {
      "machine":          "eofe7.mit.edu", 
      "username":         "YOUR_USERNAME",
      "scratch":          "/pool001/YOUR_USERNAME/scratch/",
      "slurm": {
               "account":      "YOUR_ACCOUNT",
               "partition":    "YOUR_PARTITION",
               "constraint":   "gpu",
               "mem":          "4GB" 
         }
   }
}

The slurm field can now take those four subfields (partition being the only non-optional). The mem sub-field refers to the memory requirements (to be written directly as --mem in SBATCH) for that specific code. There can be logic inside MITIM to change that value if needed, but that's on the works. Let me know if this helps with your issue.

And I agree, @cholland, we should avoid too much automation of CGYRO. We can discuss more about how I'm thinking we can do this. Working with @nthoward on a CGYRO class.

pabloprf commented 1 month ago

Since memory requirements are now provided, optionally, as part of the config file, I'm closing this issue. As per the automatic CGYRO submission, work on this is undergoing as part of the SMARTS SciDac https://smarts.ucsd.edu/