specifying additional header lines in remoteinfo files

WillBaldwin0 commented 2 years ago

I was having trouble with a low memory node in the womble cluster in Engineering, where the standard partition is listed as having more memory than this node (which is in the partition). One solution would be able to specify additional scheduler header lines for certain jobs, such as memory requirements.

gelzinyte commented 2 years ago

I think this is an issue of resource management, although if none of the below work for what you want to do, it shouldn't be too complicated to add this option.

I think it's a miss-specification of the config.json. If a partition has nodes with different memory specifications, to work as expected, that partition should have the smallest of the specified memories assigned to it in the config.json. (e.g. partition any should be assigned 23 GB, which is too little for you.) In your case that means that the standard partition isn't selected, but that also means that the job won't run out of the memory 🙃

In addition to any, if you have all the nodes listed individually with the correct memory specifications (e.g. any@node24 with max_mem of 100GB), the next one that matches your memory will get picked. Unfortunately, the job will be targeted to node24 even if it is busy and there is an identical node25 that's free.

A solution could be to specify all nodes with the same memory like this:

"partitions": {"any@node24,any@node25" : {"ncores": 32, "max_time" : "168h", "max_mem": "50GB"}

which will result in -q any@node24,any@node25, in sun grid engine.

As for headers, I think the idea was that they often don't change from job to job, so are specified in config.json, and the parts that do change are passed via the resources. It's possible to specify per-project config (~/.expyre/config.json -> wdir/_expyre/config.json) and specify the headers there, if you need something specific. Although you might need to set no_default_header=True and specify the full header if you'd like to overwrite something that expyre manages (e.g. max_mem; expyre takes the user-specified headers from config.json and appends them with some pre-sets, unless no_default_header is True).

Alternatively, maybe it's worth adding min_mem to system.submit via resources, in a similar way to max_time? That way, (at least in config.json for sge) one could specify a #$ -l s_vmem={min_mem} which would only target the job to nodes with more than min_mem available, if the nodes in the selected partition have varying amounts of memory available. I guess that's only a problem in small inhomogeneous clusters like Womble.

bernstei commented 2 years ago

I think it's worth noting that there are at least 3 communication issues here. One is what the user tells ExPyRe about the memory usage. Another is is what ExPyRe tells the queuing system about memory requirements. The third is what the queuing system knows about the actual memory capacity of each node.

The whole point of the "partition" concept in ExPyRe right now is that it is a sensible label that ExPyRe can use to select resources, and if the contents of each partition are not uniform, that assumption is violated.

There are many options to deal with this

Avoid the entire non-homogeneous partition. List the minimum memory in config.json, and then the user can specify the memory so ExPyRe will skip the entire partition (even though only 1 node will be a problem in practice). This is already possible, but inefficient for your system.
Construct a homogeneous partition. As @gelzinyte says, at least with some queuing systems you could define your own "partition" which consists of an explicit list of a homogeneous set of nodes, but frankly if you have to list 24/25 nodes, just the partition name would become cumbersome. That's easy to fix by separating the partition label which is passed to ExPyRe from the string that ExPyRe puts in the relevant queuing system header. This is rather cumbersome, and will be very inefficient on some queuing systems (remain queued indefinitely waiting for the first node in the list).
Note that the partition name is not actually sufficient. ExPyRe already knows about node/task counts and time limits as separate resources. It would be straightforward, if not trivial, to add memory (per task or job total, I guess) to that list, which would work as long as a, the queuing system knew the actual memory capacity of each node, and b, the user specified the memory usage of the job.
Specify per-job arbitrary headers. The most flexible solution, which would work for arbitrary resources (GPUs, big memory nodes that are selected via a special queuing system property, etc), is to specify per-job queuing system headers in the remoteinfo structure. This is the easiest to implement, so I'm going to do it. It's maybe 10 lines of code in ExPyRe and 3 in workflow.

My preference is to do 4 now, and make a new issue (perhaps), intended for a longer time scale, suggesting 3.

gelzinyte commented 2 years ago

I think one of the inefficiencies about ExPyRe selecting the resources is that it selects a single partition. That is inefficient if multiple suitable partitions exist, but the one selected is busy. If it was possible to submit to multiple suitable partitions/queues/nodes at once (e.g. in sge #$ -q any@node1\n#$ -q any@node2 works), one could just list all of them once in config.json and forget about it. That would solve points 1 and 2.

I thought any queueing system would have that information? I was thinking of something slightly different, though I think both that and your point 3 are solved by 4.

bernstei commented 2 years ago

I think one of the inefficiencies about ExPyRe selecting the resources is that it selects a single partition. That is inefficient if multiple suitable partitions exist, but the one selected is busy. If it was possible to submit to multiple suitable partitions/queues/nodes at once (e.g. in sge #$ -q any@node1\n#$ -q any@node2 works), one could just list all of them once in config.json and forget about it. That would solve points 1 and 2.

I have been assuming that the way to specify multiple partitions is to put them on the same queuing system header line, so ExPyRe can just treat the partition config string as an opaque thing, and the user can set it to "partition1,partition2", e.g., if they want. If that syntax isn't supported on your queuing system, then we should think about whether it's worth making it more general.

The other bit of complexity with what you suggest is that ExPyRe does a bunch of massaging of task/node number requests, so you can ask for 32 cores, and not care if your config specifies that your nodes have 16 or 32 cores, it'll request 2 or 1, respectively. You can also request 1 node, and it'll figure out how many tasks that corresponds to. If it can try to match more than 1 partition, then number of cores/node may not be the same for all. Given that all queuing systems that I know of (except the magic wrapper I wrote for our local cluster) require that you specify number of nodes and number of cores per node, I think it's just going to be impossible if the system has partitions with nodes that have different core counts.

This is exactly the sort of thing that made me not want to use anyone else's remote running framework :) If we want to do something, I think we need to very carefully define our use cases, keeping in mind the way different people use ExPyRe and the different queuing systems we want to support, and think about what can be done.

gelzinyte commented 2 years ago

"partition1,partition2" is supported, and I think it all works quite well, I was thinking that might make it even easier. Although did not at all consider nodes/cores_per_node issue at all. In our case multi-node jobs aren't supported, so it's a bit more straightforward.

bernstei commented 2 years ago

I'm going to close this issue. I opened #8 for specifically dealing with memory-related queuing system header entries, and we can open another one for partition selection if we decide it's worth it.

libAtoms / ExPyRe

specifying additional header lines in remoteinfo files #6