payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
21 stars 27 forks source link

missing -wdir arguments #341

Closed harshula closed 1 year ago

harshula commented 1 year ago

When testing a Spack build of access-om2 using Payu, I was receiving the following errors:

ice: error reading coupling_nml
...
assertion failed: Input atm.nml does not exist.

I noticed that -wdir is missing from the arguments given to mpirun: mpirun --mca io ompio --mca io_ompio_num_aggregators 1 -np 1 $SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git/atmosphere/yatm.exe : -np 216 $SCRATCH//access-om2/work/1deg_jra55_ryf_spackv1.git/ocean/fms_ACCESS-OM.x : -np 24 $SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git/ice/cice_auscom_360x300_24x1_24p.exe

dougiesquire commented 1 year ago

Are you calling payu run from the config directory ($SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git)?

harshula commented 1 year ago

Sorry, no. It's executed in: $HOME/payu/1deg_jra55_ryf_spackv1.git

$HOME/payu/1deg_jra55_ryf_spackv1.git/work is a symlink to $SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git

dougiesquire commented 1 year ago

Do things work if you run from $SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git?

harshula commented 1 year ago

Running from the "work" directory results in the job disappearing and a file (1deg_jra55_ryf.e86040237) with:

FileNotFoundError: [Errno 2] No such file or directory: '$SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git/config.yaml

My understanding is that the "work" directory is temporary.

harshula commented 1 year ago

I'm following these instructions: https://github.com/COSIMA/access-om2/wiki/Getting-started#building-the-models

I think @aidanheerdegen tracked it down to https://github.com/payu-org/payu/blob/a771fe7447ee19fd123b07414ddca64f95dabf5a/payu/experiment.py#L517-L520

dougiesquire commented 1 year ago

My understanding is that the "work" directory is temporary.

Yes, my apologies, I wasn't reading your paths carefully enough

harshula commented 1 year ago

Nor was I, when I started answering your original question! :-)

aidanheerdegen commented 1 year ago

I think @aidanheerdegen tracked it down

I believe the issue is with the introspection payu uses to determine the correct mpirun options as linked above it looks for libmpi.so in the linked libraries to determine the mpi_module type and version used, and then uses this value to determine the command line argument options to mpirun:

https://github.com/payu-org/payu/blob/a771fe7447ee19fd123b07414ddca64f95dabf5a/payu/experiment.py#L524C1-L529

Does this fail for spack builds @harshula? If so, what logic would we have to add to support spack built executables directly?

harshula commented 1 year ago

I'll come back to this once openmpi is sorted. A more general question, is there a way to override Payu's heuristics via the config file? In this instance, can we force Payu to insert -wdir via an option in the config file?

aidanheerdegen commented 1 year ago

Yes there is, indirectly. By adding something like this to config.yaml:

mpi:
   module: openmpi/4.1.0

(Sorry, drafted this days ago and didn't "send")

harshula commented 1 year ago

Doesn't that result in the system/gadi openmpi being used at runtime instead of the Spack built openmpi?

aidanheerdegen commented 1 year ago

Yep. It is a work-around, so not appropriate in some circumstances. Definitely should just fix the introspection stuff to either detect this correctly, or just default to openmpi so there is at least something appropriate.

harshula commented 1 year ago

Notes

def lib_update(bin_path, lib_name):
    # Local import to avoid reversion interference
    # TODO: Bad design, fixme!
    # NOTE: We may be able to move this now that reversion is going away
    from payu import fsops

    # TODO: Use objdump instead of ldd
    cmd = 'ldd {0}'.format(bin_path)
    ldd_output = subprocess.check_output(shlex.split(cmd)).decode('ascii')
    slibs = ldd_output.split('\n')

    for lib_entry in slibs:
        if lib_name in lib_entry:
            lib_path = lib_entry.split()[2]

            # pylint: disable=unbalanced-tuple-unpacking
BUG >>            mod_name, mod_version = fsops.splitpath(lib_path)[2:4]

            module('unload', mod_name)
            module('load', os.path.join(mod_name, mod_version))
            return '{0}/{1}'.format(mod_name, mod_version)

    # If there are no libraries, return an empty string
    return ''
harshula commented 1 year ago

The code is expecting the line:

libmpi.so.40 => /apps/openmpi/4.0.2/lib/libmpi.so.40

but receives the line:

libmpi.so.40 => $HOME/spack-microarchitectures.git/opt/spack/linux-rocky8-cascadelake/intel-2019.5.281/openmpi-4.1.5-ooyg5wc7sa3tvmcpazqqb44pzip3wbyo/lib/libmpi.so.40 (0x000014a7cabad000)
harshula commented 1 year ago

This is the override mechanism that @aidanheerdegen mentioned earlier:

            mpi_config = self.config.get('mpi', {})
            mpi_module = mpi_config.get('module', None)

We could extend this to be more flexible.

harshula commented 1 year ago

[Updated: 28/07/2023]

Requirements

harshula commented 1 year ago

Notes A general solution to this type of problem is to create a function (e.g. https://github.com/harshula/payu/compare/master...harshula:payu:spack) that creates a data structure of all the required libraries per binary. Ideally this data structure should be initialised when the model object is instantiated to allow any function to access the data without requiring additional system calls and subsequent filesystem reads.

jo-basevi commented 1 year ago

If openmpi is required and spack's version is required, then module load openmpi version from /apps. We haven't tuned Spack's openmpi, yet.

@Harshula is this still the case or is it now ok to use spacks openmpi and not load the local ncis/apps version?

harshula commented 1 year ago

Sorry, I should have updated the requirements. I'll update them now. The reason why this requirement is not relevant is here: https://github.com/ACCESS-NRI/ACCESS-OM/issues/6#issuecomment-1620953535