application level scheduling

andre-merzky commented 8 months ago

This implements #3108, #2974 and #2973

mturilli commented 7 months ago

@andre-merzky can we fix the tests pls?

mturilli commented 5 months ago

@ejjordan are you happy with this PR? When you approve it we can merge it into devel

andre-merzky commented 5 months ago

TODO: cover use case where two ranks use 1.5 GPUs each

andre-merzky commented 5 months ago

TODO: fir jsrun, shared GPUs should reside in the same resource set.

ejjordan commented 4 months ago

@ejjordan are you happy with this PR? When you approve it we can merge it into devel

This meets my needs for now as is, and is thus fine to merge from my perspective. There are some improvements that I might like to see as follow-ups or perhaps here before merge though.

1) I still think it is useful to have the ability to query to see how many tasks, if any, can be run. This is useful in the case where I only want the heartbeat to determine how many tasks to try to schedule at a time, while some other function(ality) actually schedules and/or launches the tasks. 2) It would be nice to have a more descriptive way to specify the resource requests. In particular it would be useful to have a priority for requests so that I can for instance focus on getting many small tasks done vs one big one (or vice versa). This would probably also mean that resource requests should have some sort of ID. In the case that I query a bunch of tasks but only get back that I can schedule some of them, I need to know which tasks to schedule and cannot rely for instance on any ordering. 3) It might be nice if the names were a bit more verbose. Instead of def find_slots(self, rr: RankRequirements, n_slots:int = 1) -> List[Slot]: it might be nicer to use more fully qualified names like def find_slots(self, rank_reqs: RankRequirements, num_slots:int = 1) -> List[Slot]: In particular it would be nice to have the slot members not be [_RO] but instead just use the class name so that one does not need to look at the source to try to find out what an _RO is.

But these are just suggestions for improvements, not requirements from my end, so take them for what they are.

andre-merzky commented 4 months ago

I still think it is useful to have the ability to query to see how many tasks, if any, can be run. This is useful in the case where I only want the heartbeat to determine how many tasks to try to schedule at a time, while some other function(ality) actually schedules and/or launches the tasks.

This really depends on the specific task types - and short of scheduling the tasks, there is really no way of answering that question reliably.

Assuming you have a single task type, you can count the number of tasks which can be scheduled with this piece of code:

    allocs = list()
    i = 0
    while True:
        slots = nodelist.find_slots(rr, ranks=ranks_per_task)
        if slots:
            allocs.append(slots)
            i += 1
        else:
            print('no slots for task %d' % i)
            break

    for slots in allocs:
        nl.release_slots(slots)

andre-merzky commented 4 months ago

It would be nice to have a more descriptive way to specify the resource requests. In particular it would be useful to have a priority for requests so that I can for instance focus on getting many small tasks done vs one big one (or vice versa). This would probably also mean that resource requests should have some sort of ID. In the case that I query a bunch of tasks but only get back that I can schedule some of them, I need to know which tasks to schedule and cannot rely for instance on any ordering.

If you do application level scheduling, then the order is up to you, really - i.e., if you want to prefer small tasks, then you can schedule small tasks first.

If you want to rely on the existing agent scheduler, then this is out of scope for now I am afraid: that scheduler prefers large tasks to increase overall resource utilization. If your use case is strong enough to warrant the implementation of a different scheduling algorithm, we would need to discuss in the group how we can find resources to implement that.

ping @mturilli

andre-merzky commented 4 months ago

It might be nice if the names were a bit more verbose. Instead of def find_slots(self, rr: RankRequirements, n_slots:int = 1) -> List[Slot]: it might be nicer to use more fully qualified names like def find_slots(self, rank_reqs: RankRequirements, num_slots:int = 1) -> List[Slot]: In particular it would be nice to have the slot members not be [_RO] but instead just use the class name so that one does not need to look at the source to try to find out what an _RO is.

Thanks for that feedback. I am going to change rr. We use the n_ prefix for counters elsewhere, so I would like to keep that consistent. _RO is a private shortcut for RankOccupation whose only purpose is to keep the code a bit more readable - the full name is still what should be exposed in the API documentation.

ejjordan commented 4 months ago

If you do application level scheduling, then the order is up to you, really - i.e., if you want to prefer small tasks, then you can schedule small tasks first.

If you want to rely on the existing agent scheduler, then this is out of scope for now I am afraid: that scheduler prefers large tasks to increase overall resource utilization. If your use case is strong enough to warrant the implementation of a different scheduling algorithm, we would need to discuss in the group how we can find resources to implement that.

Since I am doing the scheduling myself, I was really only asking for a way to influence the set of tasks I get back in the case that there are more tasks than available resources. A simple algorithm of just picking the tasks with the highest priorities to return resources for would be sufficient. This leaves the actual decisions on whether, e.g., small or big tasks should run first at least somewhat in the hands of a DAG author. In particular I am interested in investigating how to tune scheduling to make a pile of tasks run as fast as possible, so some way to influence the task order is a useful starting point.

Like I say though, this is not a blocker for me. I will play around with this on my own and report back on anything interesting that I find.

andre-merzky commented 4 months ago

IMHO the best way forward would be to overload nodelist.find_slots() with a respective task weighting algorithm.

ejjordan commented 4 months ago

This really depends on the specific task types - and short of scheduling the tasks, there is really no way of answering that question reliably.

I don't need a foolproof solution, since I will anyway check that there are actually resources available before actually scheduling the tasks. It would just be useful to get a list of potentially schedulable tasks to send to the part of the code that actually schedules the tasks.

Again this is not a blocker but a nice-to-have. I will also play around with this and report back any findings

andre-merzky commented 4 months ago

I understand that these issues are non-blocking - they are interesting to discuss anyway :-)

I will anyway check that there are actually resources available before actually scheduling the tasks.

This is something I do not understand: checking for resources for a task has exactly the same cost as scheduling that task, AFAICT. I think I misunderstand why you want to do the check first, maybe?

ejjordan commented 4 months ago

This is something I do not understand: checking for resources for a task has exactly the same cost as scheduling that task, AFAICT. I think I misunderstand why you want to do the check first, maybe?

This is not about performance but about separation of concerns. I want to check how many tasks I might run in the heartbeat and run whichever tasks I can lower down in the call stack. In particular I would like to be able to check how many tasks I might run without then having to unschedule the tasks that I have allocated resources for.

N.B. this is as much about my own mental model of how to do things in a way that seems intuitive as about ease of coding, so perhaps what needs a rewrite is my mental model and not the code.

andre-merzky commented 4 months ago

Thanks for clarifying, that helps!

andre-merzky commented 4 months ago

@mtitov: the last commits recover the jsrun GPU sharing by re-using the previous scheduler semantics when jsrun is being used.

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 39.66143% with 499 lines in your changes missing coverage. Please review.

Project coverage is 43.51%. Comparing base (61a3bbb) to head (dd04a6e).

Files with missing lines	Patch %	Lines
.../radical/pilot/agent/scheduler/continuous_jsrun.py	0.00%	208 Missing :warning:
src/radical/pilot/resource_config.py	31.37%	175 Missing :warning:
src/radical/pilot/utils/prof_utils.py	0.00%	27 Missing :warning:
src/radical/pilot/agent/scheduler/continuous.py	78.75%	17 Missing :warning:
src/radical/pilot/utils/misc.py	73.01%	17 Missing :warning:
src/radical/pilot/session.py	0.00%	12 Missing :warning:
src/radical/pilot/agent/scheduler/base.py	69.69%	10 Missing :warning:
...c/radical/pilot/agent/scheduler/continuous_colo.py	0.00%	6 Missing :warning:
src/radical/pilot/agent/scheduler/hombre.py	0.00%	6 Missing :warning:
src/radical/pilot/task.py	57.14%	6 Missing :warning:
... and 6 more

Additional details and impacted files

```diff @@ Coverage Diff @@ ## devel #3117 +/- ## ========================================== - Coverage 44.40% 43.51% -0.90% ========================================== Files 95 96 +1 Lines 10392 10951 +559 ========================================== + Hits 4615 4765 +150 - Misses 5777 6186 +409 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andre-merzky commented 3 months ago

Thanks for the input, @ejjordan . Some comments:

we don't use dataclasses - the ru.TypedDict has some features we use throughout the RCT stack, such as fast conversion to/from json and msgpack, optional runtime verification and inheritance, and we are certain how they perform (quite some time invested in optimizing performance that as they are used very frequently throughout the code.
the RankRequiremtents comparison ops change the semantics needed for a partially ordered list - that in turn is needed for fast nodelist traversion by the scheduler a and to optimize tight task placement.
- the code changes make a difficult to easily check, but I think the API changes you introduces are, mostly, renaming NodeResources to NodeManager, adding an underscore to allocate_slot, and removing some init semantics. Did I miss something? You wrote that you find it more intuitive to use. What makes it more intuitive in your opinion?
the most important point though: you mentioned that the implementation failed your tests. Could you please provide those tests? I would like to check how / why those failed.

Thanks, Andre.

ejjordan commented 3 months ago

As far as the changes, I did try to keep them minimal. 1) Using dataclasses removed the need for some of the initialization code, which I then removed 2) comparison semantics simplified 3) since dataclasses are not really meant to be used as classes with methods, I implemented the methods on NodeResources in NodeManager. The original call signatures could be regained by renaming NodeManager -> NodeResources and also coming up with a new name for that dataclass

As far as where I found it non-intuitive, I think I recall (it was a few weeks ago, before I went on vacation so I am not 100% sure) that it had to do with initialization and conversion between string and int. The ru.TypedDict class seems to lack the natural conversion of strings to ints? Also, having classes instead of dicts makes the initialization a little simpler and feel more pythonic (to me at least).

      nodes = [
            NodeManager(
                NodeResources(
                    index=i,
                    name=self.node_names[i],
                    cores=[
                        ResourceOccupation(index=core_idx, occupation=FREE)
                        for core_idx in range(self.tasks_per_node)
                    ],
                    gpus=[
                        ResourceOccupation(index=gpu_idx, occupation=FREE)
                        for gpu_idx in range(self.gpus_per_node)
                    ],
                    mem=self.mem_per_node,
                )
            )
            for i in range(self.num_nodes)
        ]

        self.nodes_list = NodeList(nodes=nodes)

As far as the tests, if I run pytest in radical.pilot/tests/unit_tests/test_scheduler/ I get the following.

FAILED test_base.py::TestBaseScheduling::test_change_slot_states - TypeError: list indices must be integers or slices, not str
FAILED test_base.py::TestBaseScheduling::test_try_allocation - AssertionError: Lists differ: [[{'node_name': 'node-0000', 'node_index':[69 chars]None] != [{'node_name': 'node-0000', 'node_index': [61 chars]128}]
FAILED test_continuous.py::TestContinuous::test_find_resources - TypeError: Continuous._find_resources() got an unexpected keyword argument 'n_slots'
FAILED test_continuous.py::TestContinuous::test_schedule_task - KeyError: 'node_id'
FAILED test_continuous.py::TestContinuous::test_scheduling - KeyError: 'resources'
FAILED test_continuous.py::TestContinuous::test_unschedule_task - AttributeError: module 'radical.pilot.utils' has no attribute 'convert_slots'
============================================================================================== 6 failed, 7 passed, 1 warning in 0.24s

andre-merzky commented 3 months ago

Tests: that is really strange, as the test suite passes for this PR. Is there any difference in the RCT stack used? If not, do you see a way how I can reproduce this?

andre-merzky commented 3 months ago

About intuitive creation: the nodelist and node resources are not really expected to be created by the user, but instead you would obtain that from a pilot via pilot.node_list. How would you otherwise use a nodelist created manually?

ejjordan commented 3 months ago

the nodelist and node resources are not really expected to be created by the user, but instead you would obtain that from a pilot via pilot.node_list. How would you otherwise use a nodelist created manually?

I decided to manually create the node_list because I found that easier than trying to understand how to create a ResourceManager or its Slurm variant. I think the tests are not catching the fact that Slurm(cfg=None, log=None, prof=None) does not work because it seems that the Slurm class is only mocked. Both a log (trivial) and a cfg (here is where I gave up after spending maybe half an hour to figure out how to create a cfg). It was simply easier to implement my own version of ResourceManager and continue trying to make progress with that. Indeed, I think I spent about as much time writing the code I posted above as I have trying to figure out how to use ResourceManager.

ejjordan commented 3 months ago

The tests failing turned out to be my mistake. I had not rebuilt radical.pilot and editable install does not work for me for some reason.

andre-merzky commented 3 months ago

Indeed, I think I spent about as much time writing the code I posted above as I have trying to figure out how to use ResourceManager.

I am not surprised by that, really - the ResourceManager is a support class which lives deeply within the pilot agent and is thus very far removed from what we expose to the end user and to the application level - it is indeed not expected to be used directly, at all. It's unit tests are tailored to ensure the RM is working in the environment it is being used in (pilot agent), and isolating the RM on that level would be a non-trivial exercise indeed.

I continue to be confused on why you attach code, examples or tests to internal classes of RP though. Not that I mind, it's open source and I appreciate the feedback - but also, I would fully expect that to be a time consuming exercise (as you realized) with little to show for in terms of integration with airflow or facts - those classes are just not part of the API surface I would expect to be useful for that integration. For airflow specifically I would expect that the API being used is on the level of the PilotManager and TaskManager, and AFAICT that is still what is being used for the radical_local_executor in airflow. How is the rp.ResourceManager entering the picture here?

ejjordan commented 3 months ago

For airflow specifically I would expect that the API being used is on the level of the PilotManager and TaskManager, and AFAICT that is still what is being used for the radical_local_executor in airflow. How is the rp.ResourceManager entering the picture here?

Here is the MR I am working on that uses the code I posted above: https://github.com/ejjordan/airflowHPC/pull/17 I have not attempted to use the PilotManager or TaskManager because all I really need is a way to keep track of which resources are occupied. For this purpose, it seems like something at the scope of the ResourceManager is the right tool for the job, but I am happy to hear suggestions to the contrary.

andre-merzky commented 3 months ago

Here is the MR I am working on that uses the code I posted above: ejjordan/airflowHPC#17

So you are not using the rp.PilotManager. That implies that you will not launch a pilot. Looking at the code it seems like you intent to create one worker per node and then use popen to run the tasks (which presumably are all single core then?) - is that correct?

Well in this case, nothing I said is of value - if you are not using RP really but just want to pluck some ideas from it's codebase, sure, by all means, you are free to do so, and if the classes you create are serving that purpose that's great. But in that case the resulting code should not have impact on this RP pull request.

ejjordan commented 3 months ago

Having a usable ResourceManager class would be helpful for me, especially since I am only testing on one system that uses slurm. It would be very nice to automatically manage resources on different kinds of systems.

It was not my intention to ruffle and feathers with my suggestion, but to try to upstream some code that I considered slightly improved or at the very least spark a discussion about mental/programming models. I am sorry if I caused any upset.

andre-merzky commented 3 months ago

Oh no worries, I am not upset at all! :-) It is sometimes difficult for me to convey that kind of information via chat, so please accept my apologies, I did not mean to be hostile. The discussion is interesting! It is just that I have some troubles scoping it: is it feedback for this PR, how do the changes relate to what radical.pilot needs, is that a discussion of code which lives outside of RCT, what is the use case, etc. That's all a bit convoluted to me.

And I am fairly protective of the RCT code base, that too ;-) For example, the change of ru.TypedDict to dataclasses has so many implications for the whole stack that I would really need to see benefits for the whole stack (safety, performance, simplicity) before entertaining the possibility if that is something we would want to invest time in. If that code lives in ScaleMS or airflow, that discussion is very different...

andre-merzky commented 2 months ago

@andre-merzky have spotted some misses (important one in the scheduler module)

p.s., btw there are a lot off logs - when log level is set to DEBUG, messages from log.debug_N are still collected

Hmm, that is strange:

>>> import radical.utils as ru
>>> log = ru.Logger('radical.utils', level='DEBUG', targets='-')
>>> log.info('info')
1719820132.595 : radical.utils        : 3407964 : 140205897220224 : INFO     : info

>>> log.debug('debug')
1719820140.305 : radical.utils        : 3407964 : 140205897220224 : DEBUG    : debug

>>> log.debug_2('debug_2')
>>>

Any idea what's happening with the logs on your end? Do you have an example of a log message which slips through?

mtitov commented 2 months ago

Any idea what's happening with the logs on your end? Do you have an example of a log message which slips through?

(*) while was describing the "issue", found the source of it :)

https://github.com/radical-cybertools/radical.pilot/blob/27f52a3e21c363025d02ff4cf2d5b81a7abe802e/src/radical/pilot/pmgr/launching/base.py#L866

seems like you've set it temporary, just don't forget to revert it back

andre-merzky commented 2 months ago

while was describing the "issue", found the source of it :)

Oops, sorry for that - fixed!

andre-merzky commented 2 months ago

colocation: I would like to leave this alone in this PR and open an issue - is that ok for you?

andre-merzky commented 2 months ago

@mtitov : I addresses the comments, tests are passing again, so this should be ready for the next iteration :-) Thanks for your patience!

andre-merzky commented 1 month ago

LGTM!!

Wohoo!

radical-cybertools / radical.pilot

application level scheduling #3117

Codecov Report