setting WFL_AUTOPARA_NPOOL

WillBaldwin0 commented 2 years ago

When I have been doing single FHI-aims calculations across multiple nodes (AIMS requires a lot of memory), I've been specifying resources in the remoteinfo.json like this:

"resources": { "n" : [4, "nodes"], "max_time": "24h", "ncores_per_task" : 1, "partitions": "highmem" }

Is this the right way to run this calculation? When I do this and manually set OMP_NUM_THREADS and WFL_AUTOPARA_NPOOL to 1 at the start of the job, I get 512 mpi tasks as expected (4 nodes with 128 cores each) and FHI-aims runs correctly. However, by default the code sets WFL_AUTOPARA_NPOOL equal to the number of tasks per node, which is 128. This then ends up parallelising through the chunk size which I do not want. Before I manually set the variables, the FHI-aims output was also telling me that OMP_NUM_THREADS was not 1.

Is this there a right way to do this, using more specification in the resources dictionary? I can't seem to follow how this part of the code passes the information around very easily.

bernstei commented 2 years ago

The right way to get one configuration per job submission is to set (not in resources, but in another part of the remoteinfo dict, see the top level workflow README.md) "job_chunksize": 1. I don't think anything in workflow sets OMP_NUM_THREADS by default (maybe for GAP fit, but not in general), unless someone specifically coded it that way for the fhi calculator wrapper, or maybe in your machine's .expyre/config.json. Whether FHI-aims runs correctly with that env var not set at all, I don't know, so you might still need to set it in the "env_vars" remoteinfo dict entry.

You could add a "pre_command" dict entry to do "env" and see everything that's set when the job run.

WillBaldwin0 commented 2 years ago

I do also have the job_chunksize set to what I want it to be in remoteinfo (I want to run several of these, so job_chunksize is sometimes not 1), it was still setting WFL_AUTOPARA_NPOOL to something other than 1.

WillBaldwin0 commented 2 years ago

For completeness, this is the structure of me config.json and remoteinfo.json:

{
   "systems":{
      "local":{
         "host":null,
         "scheduler":"slurm",
         "rundir":"/work/e89/e89/wjb48/_expyre/remote_rundir",
         "commands":[
            "source /work/e89/e89/wjb48/.bashrcwork",
            "export OMP_NUM_THREADS=1",
            "which wfl",
            "which python"
         ],
         "header":[
            "#SBATCH --nodes={nnodes}",
            "#SBATCH --tasks-per-node={ncores_per_node}",
            "#SBATCH --cpus-per-task=1",
            "#SBATCH --account=e89-came",
            "#SBATCH --qos=highmem",
            "#SBATCH --exclusive"
         ],
         "partitions":{
            "standard":{"ncores":128, "max_time":"24h", "max_mem":"256GB"},
            "highmem":{"ncores":128, "max_time":"24h", "max_mem":"512GB"}
         }
      }
   }
}

{
 "dft.py::evaluate_dft" : {
     "sys_name": "local",
     "job_name": "aims",
     "resources": { "n" : [4, "nodes"], "max_time": "24h", "ncores_per_task" : 1, "partitions": "highmem" },
     "job_chunksize" : 10,
     "env_vars":["OMP_NUM_THREADS=1", "WFL_AUTOPARA_NPOOL=1"]
   }
}

And the resulting job script that this gives me is this:

#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --account=e89-came
#SBATCH --qos=highmem
#SBATCH --exclusive
#SBATCH --job-name=aims_chunk_1_61XBdT4tHD1P6rFUSSHOJMynKYB4KDWLhVMdA-lAo2E=_chcwoyi4
#SBATCH --partition=highmem
#SBATCH --time=24:00:00
#SBATCH --output=job.aims_chunk_1_61XBdT4tHD1P6rFUSSHOJMynKYB4KDWLhVMdA-lAo2E=_chcwoyi4.stdout
#SBATCH --error=job.aims_chunk_1_61XBdT4tHD1P6rFUSSHOJMynKYB4KDWLhVMdA-lAo2E=_chcwoyi4.stderr

if [ ! -z $SLURM_TASKS_PER_NODE ]; then
    if echo "${SLURM_TASKS_PER_NODE}"| grep -q ","; then
        echo "Using only first part of hetereogeneous tasks per node spec ${SLURM_TASKS_PER_NODE}"
    fi
    export EXPYRE_NCORES_PER_NODE=$(echo $SLURM_TASKS_PER_NODE | sed "s/(.*//")
else
    export EXPYRE_NCORES_PER_NODE=128
fi
export EXPYRE_NCORES_PER_TASK=1
export EXPYRE_NNODES=4
export EXPYRE_TOT_NCORES=512
export EXPYRE_NTASKS_PER_NODE=128
export EXPYRE_TOT_NTASKS=512
cd /work/e89/e89/wjb48/_expyre/run_aims_chunk_1_61XBdT4tHD1P6rFUSSHOJMynKYB4KDWLhVMdA-lAo2E=_chcwoyi4
source /work/e89/e89/wjb48/.bashrcwork
export OMP_NUM_THREADS=1
which wfl
which python
touch _expyre_job_started
(
export WFL_AUTOPARA_NPOOL=${EXPYRE_NTASKS_PER_NODE}
export OMP_NUM_THREADS=${EXPYRE_NCORES_PER_TASK}
export OMP_NUM_THREADS=1
export WFL_AUTOPARA_NPOOL=1
python3 _expyre_script_core.py
error_stat=$?
exit $error_stat
)
error_stat=$?
if [ $error_stat == 0 ]; then
    if [ -e _tmp_expyre_job_succeeded ]; then
        mv _tmp_expyre_job_succeeded _expyre_job_succeeded
    else
        echo "No error code but _tmp_expyre_job_succeeded does not exist" > _tmp_expyre_job_failed
        if [ -f _expyre_error ]; then
            cat _expyre_error >> _tmp_expyre_job_failed
        fi
        mv _tmp_expyre_job_failed _expyre_job_failed
    fi
else
    if [ -e _expyre_error ]; then
        mv _expyre_error _expyre_job_failed
    else
        echo "UNKNOWN ERROR $error_stat" > _tmp_expyre_job_failed
        mv _tmp_expyre_job_failed _expyre_job_failed
    fi
fi

So in this case OMP_NUM_THREADS is always being set correctly, and then I also manually overwrite it using the env_vars too. However, WFL_AUTOPARA_NPOOLgets set toEXPYRE_NTASKS_PRE_NODEwhich is 128, and so the manual overwrite that I have put in theenv_vars` is needed.

bernstei commented 2 years ago

OK, I think I understand what you're trying to achieve a bit better. The fundamental problem is that there are so many levels of possible parallelization - configs per remote job, configs per chunk for pool parallelization within each job, mpi tasks per mpirun invocation, and OpenMP threads per task (whether that's an mpi task or a serial run).

Let's back up a bit, figure out your precise use case, and see if there's a better way of achieving it with the current controls, and if not, whether finer grained control is worth it. How exactly do you want things chopped up? How many configs total, how many configs per job, how big are the nodes, how many cores do you want for each job (i.e. qsub), do you want pool based parallelization, and the for the bottom level job, just mpi or just serial or just openmp or mpi + openmp?

bernstei commented 2 years ago

@WillBaldwin0 we can also arrange for an interactive discussion (maybe at group meeting) if you think that'll be more effective.

WillBaldwin0 commented 2 years ago

So I have large calculations to do, which means single calculations across multiple nodes. What I want to do is dispatch normal workflow jobs where each job has N .xyz's to evaluate, and it does them in serial, using 500 cores.

This works fine on our cluster in engineering, when workflow assigns one task per node. FHI-aims then reports processes on all the cores on the node and all is well. With multiple nodes, I don't think that I can assign one workflow task across several nodes.

Sorry I had to dash off after my part of the group meeting!

bernstei commented 2 years ago

I'm still not entirely following.

High level question - why bother with more than one config per submitted job (your N > 1), if they'll be running in series anyway?

If you really want each submitted job to do its multiple configs in series, someone has to set WFL_AUTOPARA_NPOOL=1. There has to be some default, and I thought it'd be easier for the code to remember the details for the only general non-trivial option, i.e. NPOOL = NTASKS (same as NCORES by default, but you can also set NTASKS < NCORES by setting cores_per_task, or whatever it is called exactly). I think I always envisioned that if the user wants the submitted job to do configs in series, they'd have to explicitly set WFL_AUTOPARA_NPOOL=1, via env_var, which is what you're doing, so that's as designed.

Maybe this could be documented better?

If you want to autoparallelize multiple configs in a ConfigSet_in across multiple nodes, the built-in python pool mechanism won't work. There is an mpipool python module which has similar semantics but can run across multiple nodes. However, I think it'll conflict with the MPI that FHI-aims is going to try to use, so that's not really available to you in this case (it'd be possible if the low level function was not using MPI, e.g. a GAP evaluation with the generic calculator).

WillBaldwin0 commented 2 years ago

Well this would work, but I have a lot of configurations to evaluate, and the bottleneck is memory more than processing power - my current calculations run out of memory on less than 4 high memory nodes of this cluster, but then when they run they only take 40 minutes. Given that I may need to do many tens of configs at a time, and that this cluster does not allow me to queue more than 16 high memory jobs at a time, it does not seem right to have a job only do one quick calculation at a time.

bernstei commented 2 years ago

and that this cluster does not allow me to queue more than 16 high memory jobs at a time, it does not seem right to have a job only do one quick calculation at a time.

Aah - if you have an unhelpful queuing system policy, that would explain it.

Anyway. my design ideas are as they are - the remote job system has to assume something about WFL_AUTOPARA_NPOOL. It could not set it, which would effectively default to 1, or it could set it to 1, or it could set it to what I chose, NTASKS_PER_NODE, or it could have some other heuristic value. I'm willing to hear arguments against the current setup, but I don't see a better default, and given the current default you want non-default behavior, so you'll need to set WFL_AUTOPARA_NPOOL=1 via env_vars.

WillBaldwin0 commented 2 years ago

I personally prefer explicit over implicit and would rather have something like an explicit field in the resources which specifies the number of parallel configs streams per job. Saying tasks in the resources to me has been a little unclear since on our cluster, setting one task per node is fine, but on archer setting 128 tasks per node also seems to work fine. I would also be happy with requiring redundant information which workflow then checks for contradictions.

The environment variables of course work fine, but my main issue is the barrier to entry. Without knowing the internals of workflow, tasks, mpi etc, as a new user who hadn't written the code, I just struggled to understand and fix what was going on here.

bernstei commented 2 years ago

I personally prefer explicit over implicit and would rather have something like an explicit field in the resources which specifies the number of parallel configs streams per job.

What exactly do you mean by "explicit" here?

In general in wfl I like to set the number of parallel configs with the WFL_AUTOPARA_NPOOL env var. Note that all the autoparallelized functions are supposed to take an npool optional arg (which I think is followed in practice, but I haven't checked), and the env var is only used if npool is not specified. The reason I generally use the env var is because in different circumstances I want different values, so I don't want it hard coded into my scripts.

Given this general approach, why shouldn't the way you set that information for a remote job be via the general mechanism for setting env vars for the remote job? It is explicit (if you want non-default behavior), although perhaps not through the interface you expected or prefer.

I do agree that it may be surprising that the defaults for "normal" runs and remote job runs are different. We could make the default configs-sequentially for both, but I didn't do that because then the user would have to remember what expyre env var to make their WFL_AUTOPARA_NPOOL equal to. I think it's rare to set it to anything other than 1 or ntasks, so why not have the default be the less trivial one? Your use case is WFL_AUTOPARA_NPOOL=1, right?

Saying tasks in the resources to me has been a little unclear since on our cluster, setting one task per node is fine, but on archer setting 128 tasks per node also seems to work fine. I would also be happy with requiring redundant information which workflow then checks for contradictions.

A lot of this stuff is really about hints, which may not be clear. You can do whatever you want - if you function runs "mpirun" and your mpirun implementation gets the number of mpi tasks from the queuing system, nothing in the expyre/workflow system will stop you. It's all designed so that you can create scripts in ways that are fairly queuing system/node size agnostic, and enough env vars are set so that you can create mpirun strings that do the right thing.

Another important goal is to be able to specify numbers of cores OR number of nodes (regardless of how many cores), and work with both types of resource specifications. Some types of work require one (e.g. GAP fit on a single large memory node, regardless of how many cores), some the other (e.g. 64 cores of VASP, but could be 4x16 or 2x32 or whatever).

I agree that this part probably requires some improvement, but maybe that's just better documentation. And maybe the tasks vs. cores distinction makes things too complicated. It's really supposed to be MPI tasks vs. OpenMP threads.

The environment variables of course work fine, but my main issue is the barrier to entry. Without knowing the internals of workflow, tasks, mpi etc, as a new user who hadn't written the code, I just struggled to understand and fix what was going on here.

The problem is that the scope for parallelism is so complex that if we make no defaults (that's what I'm interpreting "explicit" to mean), and that'd be its own barrier to entry.

How about this, for a practical way to proceed - what documentation, if any, did you consult when trying to figure out how to set this up? Let me see if that's really all we have, and if so, whether it can be improved.

WillBaldwin0 commented 2 years ago

I think that the setup is actually correct here, in that you need one environment variable where the job submission script, and one in the job submission script. I looked through the readme's on workflow and expyre when I was struggling with this, and the workflow readme did mention it which helped.

Now that I understand it, and thought more about how workflow works and has to debug this kind of thing, I'm happy with the current setup.

There is documentation on this, but I think the reason its been a bit of a pain is because there are many places that settings and information can come from in when using workflow. There is the remoteinfo, config, and environment variables. I've found it quite hard to use workflow even when there has been documentation (but I haven't wanted to understand the internals), because of having to track the many different places that settings enter the process, and the order that they do.

What do you think about this kind of thing?

bernstei commented 2 years ago

I see how it would be confusing, but I haven't come up with a good alternative. I've though of 3 things that could help:

Reduce the actual possible number of things or places you can control
Improve the defaults so the user has to change things even less
Improve the documentation (maybe a diagram?)

I have thought about 1 and 2 and could not come up with any significant improvements, but I'm happy to hear suggestions. I fully admit that 3 is needed, and would be happy to try to write something, but there are so many ways to do this, suggestions from users as to what would be most helpful would help focus my efforts. Let me start with a diagram, and let's see if you think that's helpful.

WillBaldwin0 commented 2 years ago

Actually, since I want to understand this more so I can help document things, shall we have a call some time next week so I can ask about some stuff and help with these kind of diagrams? I would love to be able to help document things that I've found tricky, but I just don't know the underlying structure that well yet.

bernstei commented 2 years ago

Let me put together an overall diagram, since I should know all the pieces, and then we can talk about it.

libAtoms / workflow

setting WFL_AUTOPARA_NPOOL #32