Closed andre-merzky closed 1 month ago
@andre-merzky can we fix the tests pls?
@ejjordan are you happy with this PR? When you approve it we can merge it into devel
TODO: cover use case where two ranks use 1.5 GPUs each
TODO: fir jsrun
, shared GPUs should reside in the same resource set.
@ejjordan are you happy with this PR? When you approve it we can merge it into devel
This meets my needs for now as is, and is thus fine to merge from my perspective. There are some improvements that I might like to see as follow-ups or perhaps here before merge though.
1) I still think it is useful to have the ability to query to see how many tasks, if any, can be run. This is useful in the case where I only want the heartbeat to determine how many tasks to try to schedule at a time, while some other function(ality) actually schedules and/or launches the tasks.
2) It would be nice to have a more descriptive way to specify the resource requests. In particular it would be useful to have a priority for requests so that I can for instance focus on getting many small tasks done vs one big one (or vice versa). This would probably also mean that resource requests should have some sort of ID. In the case that I query a bunch of tasks but only get back that I can schedule some of them, I need to know which tasks to schedule and cannot rely for instance on any ordering.
3) It might be nice if the names were a bit more verbose. Instead of
def find_slots(self, rr: RankRequirements, n_slots:int = 1) -> List[Slot]:
it might be nicer to use more fully qualified names like
def find_slots(self, rank_reqs: RankRequirements, num_slots:int = 1) -> List[Slot]:
In particular it would be nice to have the slot members not be [_RO]
but instead just use the class name so that one does not need to look at the source to try to find out what an _RO
is.
But these are just suggestions for improvements, not requirements from my end, so take them for what they are.
I still think it is useful to have the ability to query to see how many tasks, if any, can be run. This is useful in the case where I only want the heartbeat to determine how many tasks to try to schedule at a time, while some other function(ality) actually schedules and/or launches the tasks.
This really depends on the specific task types - and short of scheduling the tasks, there is really no way of answering that question reliably.
Assuming you have a single task type, you can count the number of tasks which can be scheduled with this piece of code:
allocs = list()
i = 0
while True:
slots = nodelist.find_slots(rr, ranks=ranks_per_task)
if slots:
allocs.append(slots)
i += 1
else:
print('no slots for task %d' % i)
break
for slots in allocs:
nl.release_slots(slots)
It would be nice to have a more descriptive way to specify the resource requests. In particular it would be useful to have a priority for requests so that I can for instance focus on getting many small tasks done vs one big one (or vice versa). This would probably also mean that resource requests should have some sort of ID. In the case that I query a bunch of tasks but only get back that I can schedule some of them, I need to know which tasks to schedule and cannot rely for instance on any ordering.
If you do application level scheduling, then the order is up to you, really - i.e., if you want to prefer small tasks, then you can schedule small tasks first.
If you want to rely on the existing agent scheduler, then this is out of scope for now I am afraid: that scheduler prefers large tasks to increase overall resource utilization. If your use case is strong enough to warrant the implementation of a different scheduling algorithm, we would need to discuss in the group how we can find resources to implement that.
ping @mturilli
It might be nice if the names were a bit more verbose. Instead of
def find_slots(self, rr: RankRequirements, n_slots:int = 1) -> List[Slot]:
it might be nicer to use more fully qualified names likedef find_slots(self, rank_reqs: RankRequirements, num_slots:int = 1) -> List[Slot]:
In particular it would be nice to have the slot members not be[_RO]
but instead just use the class name so that one does not need to look at the source to try to find out what an_RO
is.
Thanks for that feedback. I am going to change rr
. We use the n_
prefix for counters elsewhere, so I would like to keep that consistent. _RO
is a private shortcut for RankOccupation
whose only purpose is to keep the code a bit more readable - the full name is still what should be exposed in the API documentation.
If you do application level scheduling, then the order is up to you, really - i.e., if you want to prefer small tasks, then you can schedule small tasks first.
If you want to rely on the existing agent scheduler, then this is out of scope for now I am afraid: that scheduler prefers large tasks to increase overall resource utilization. If your use case is strong enough to warrant the implementation of a different scheduling algorithm, we would need to discuss in the group how we can find resources to implement that.
Since I am doing the scheduling myself, I was really only asking for a way to influence the set of tasks I get back in the case that there are more tasks than available resources. A simple algorithm of just picking the tasks with the highest priorities to return resources for would be sufficient. This leaves the actual decisions on whether, e.g., small or big tasks should run first at least somewhat in the hands of a DAG author. In particular I am interested in investigating how to tune scheduling to make a pile of tasks run as fast as possible, so some way to influence the task order is a useful starting point.
Like I say though, this is not a blocker for me. I will play around with this on my own and report back on anything interesting that I find.
IMHO the best way forward would be to overload nodelist.find_slots()
with a respective task weighting algorithm.
This really depends on the specific task types - and short of scheduling the tasks, there is really no way of answering that question reliably.
I don't need a foolproof solution, since I will anyway check that there are actually resources available before actually scheduling the tasks. It would just be useful to get a list of potentially schedulable tasks to send to the part of the code that actually schedules the tasks.
Again this is not a blocker but a nice-to-have. I will also play around with this and report back any findings
I understand that these issues are non-blocking - they are interesting to discuss anyway :-)
I will anyway check that there are actually resources available before actually scheduling the tasks.
This is something I do not understand: checking for resources for a task has exactly the same cost as scheduling that task, AFAICT. I think I misunderstand why you want to do the check first, maybe?
This is something I do not understand: checking for resources for a task has exactly the same cost as scheduling that task, AFAICT. I think I misunderstand why you want to do the check first, maybe?
This is not about performance but about separation of concerns. I want to check how many tasks I might run in the heartbeat and run whichever tasks I can lower down in the call stack. In particular I would like to be able to check how many tasks I might run without then having to unschedule the tasks that I have allocated resources for.
N.B. this is as much about my own mental model of how to do things in a way that seems intuitive as about ease of coding, so perhaps what needs a rewrite is my mental model and not the code.
Thanks for clarifying, that helps!
@mtitov: the last commits recover the jsrun GPU sharing by re-using the previous scheduler semantics when jsrun
is being used.
Attention: Patch coverage is 39.66143%
with 499 lines
in your changes missing coverage. Please review.
Project coverage is 43.51%. Comparing base (
61a3bbb
) to head (dd04a6e
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks for the input, @ejjordan . Some comments:
we don't use dataclasses - the ru.TypedDict
has some features we use throughout the RCT stack, such as fast conversion to/from json and msgpack, optional runtime verification and inheritance, and we are certain how they perform (quite some time invested in optimizing performance that as they are used very frequently throughout the code.
the RankRequiremtents
comparison ops change the semantics needed for a partially ordered list - that in turn is needed for fast nodelist traversion by the scheduler a and to optimize tight task placement.
NodeResources
to NodeManager
, adding an underscore to allocate_slot
, and removing some init
semantics. Did I miss something? You wrote that you find it more intuitive to use. What makes it more intuitive in your opinion?the most important point though: you mentioned that the implementation failed your tests. Could you please provide those tests? I would like to check how / why those failed.
Thanks, Andre.
As far as the changes, I did try to keep them minimal.
1) Using dataclasses removed the need for some of the initialization code, which I then removed
2) comparison semantics simplified
3) since dataclasses are not really meant to be used as classes with methods, I implemented the methods on NodeResources
in NodeManager
. The original call signatures could be regained by renaming NodeManager -> NodeResources and also coming up with a new name for that dataclass
As far as where I found it non-intuitive, I think I recall (it was a few weeks ago, before I went on vacation so I am not 100% sure) that it had to do with initialization and conversion between string and int. The ru.TypedDict
class seems to lack the natural conversion of strings to ints?
Also, having classes instead of dicts makes the initialization a little simpler and feel more pythonic (to me at least).
nodes = [
NodeManager(
NodeResources(
index=i,
name=self.node_names[i],
cores=[
ResourceOccupation(index=core_idx, occupation=FREE)
for core_idx in range(self.tasks_per_node)
],
gpus=[
ResourceOccupation(index=gpu_idx, occupation=FREE)
for gpu_idx in range(self.gpus_per_node)
],
mem=self.mem_per_node,
)
)
for i in range(self.num_nodes)
]
self.nodes_list = NodeList(nodes=nodes)
As far as the tests, if I run pytest
in radical.pilot/tests/unit_tests/test_scheduler/
I get the following.
FAILED test_base.py::TestBaseScheduling::test_change_slot_states - TypeError: list indices must be integers or slices, not str
FAILED test_base.py::TestBaseScheduling::test_try_allocation - AssertionError: Lists differ: [[{'node_name': 'node-0000', 'node_index':[69 chars]None] != [{'node_name': 'node-0000', 'node_index': [61 chars]128}]
FAILED test_continuous.py::TestContinuous::test_find_resources - TypeError: Continuous._find_resources() got an unexpected keyword argument 'n_slots'
FAILED test_continuous.py::TestContinuous::test_schedule_task - KeyError: 'node_id'
FAILED test_continuous.py::TestContinuous::test_scheduling - KeyError: 'resources'
FAILED test_continuous.py::TestContinuous::test_unschedule_task - AttributeError: module 'radical.pilot.utils' has no attribute 'convert_slots'
============================================================================================== 6 failed, 7 passed, 1 warning in 0.24s
Tests: that is really strange, as the test suite passes for this PR. Is there any difference in the RCT stack used? If not, do you see a way how I can reproduce this?
About intuitive creation: the nodelist and node resources are not really expected to be created by the user, but instead you would obtain that from a pilot via pilot.node_list
. How would you otherwise use a nodelist created manually?
the nodelist and node resources are not really expected to be created by the user, but instead you would obtain that from a pilot via pilot.node_list. How would you otherwise use a nodelist created manually?
I decided to manually create the node_list because I found that easier than trying to understand how to create a ResourceManager
or its Slurm
variant. I think the tests are not catching the fact that Slurm(cfg=None, log=None, prof=None)
does not work because it seems that the Slurm
class is only mocked. Both a log (trivial) and a cfg (here is where I gave up after spending maybe half an hour to figure out how to create a cfg). It was simply easier to implement my own version of ResourceManager and continue trying to make progress with that. Indeed, I think I spent about as much time writing the code I posted above as I have trying to figure out how to use ResourceManager.
The tests failing turned out to be my mistake. I had not rebuilt radical.pilot and editable install does not work for me for some reason.
Indeed, I think I spent about as much time writing the code I posted above as I have trying to figure out how to use ResourceManager.
I am not surprised by that, really - the ResourceManager
is a support class which lives deeply within the pilot agent and is thus very far removed from what we expose to the end user and to the application level - it is indeed not expected to be used directly, at all. It's unit tests are tailored to ensure the RM is working in the environment it is being used in (pilot agent), and isolating the RM on that level would be a non-trivial exercise indeed.
I continue to be confused on why you attach code, examples or tests to internal classes of RP though. Not that I mind, it's open source and I appreciate the feedback - but also, I would fully expect that to be a time consuming exercise (as you realized) with little to show for in terms of integration with airflow or facts - those classes are just not part of the API surface I would expect to be useful for that integration. For airflow specifically I would expect that the API being used is on the level of the PilotManager
and TaskManager
, and AFAICT that is still what is being used for the radical_local_executor in airflow. How is the rp.ResourceManager
entering the picture here?
For airflow specifically I would expect that the API being used is on the level of the PilotManager and TaskManager, and AFAICT that is still what is being used for the radical_local_executor in airflow. How is the rp.ResourceManager entering the picture here?
Here is the MR I am working on that uses the code I posted above: https://github.com/ejjordan/airflowHPC/pull/17 I have not attempted to use the PilotManager or TaskManager because all I really need is a way to keep track of which resources are occupied. For this purpose, it seems like something at the scope of the ResourceManager is the right tool for the job, but I am happy to hear suggestions to the contrary.
Here is the MR I am working on that uses the code I posted above: ejjordan/airflowHPC#17
So you are not using the rp.PilotManager
. That implies that you will not launch a pilot. Looking at the code it seems like you intent to create one worker per node and then use popen
to run the tasks (which presumably are all single core then?) - is that correct?
Well in this case, nothing I said is of value - if you are not using RP really but just want to pluck some ideas from it's codebase, sure, by all means, you are free to do so, and if the classes you create are serving that purpose that's great. But in that case the resulting code should not have impact on this RP pull request.
Having a usable ResourceManager class would be helpful for me, especially since I am only testing on one system that uses slurm. It would be very nice to automatically manage resources on different kinds of systems.
It was not my intention to ruffle and feathers with my suggestion, but to try to upstream some code that I considered slightly improved or at the very least spark a discussion about mental/programming models. I am sorry if I caused any upset.
Oh no worries, I am not upset at all! :-) It is sometimes difficult for me to convey that kind of information via chat, so please accept my apologies, I did not mean to be hostile. The discussion is interesting! It is just that I have some troubles scoping it: is it feedback for this PR, how do the changes relate to what radical.pilot needs, is that a discussion of code which lives outside of RCT, what is the use case, etc. That's all a bit convoluted to me.
And I am fairly protective of the RCT code base, that too ;-) For example, the change of ru.TypedDict
to dataclasses
has so many implications for the whole stack that I would really need to see benefits for the whole stack (safety, performance, simplicity) before entertaining the possibility if that is something we would want to invest time in. If that code lives in ScaleMS or airflow, that discussion is very different...
@andre-merzky have spotted some misses (important one in the scheduler module)
p.s., btw there are a lot off logs - when log level is set to DEBUG, messages from
log.debug_N
are still collected
Hmm, that is strange:
>>> import radical.utils as ru
>>> log = ru.Logger('radical.utils', level='DEBUG', targets='-')
>>> log.info('info')
1719820132.595 : radical.utils : 3407964 : 140205897220224 : INFO : info
>>> log.debug('debug')
1719820140.305 : radical.utils : 3407964 : 140205897220224 : DEBUG : debug
>>> log.debug_2('debug_2')
>>>
Any idea what's happening with the logs on your end? Do you have an example of a log message which slips through?
Any idea what's happening with the logs on your end? Do you have an example of a log message which slips through?
(*) while was describing the "issue", found the source of it :)
seems like you've set it temporary, just don't forget to revert it back
while was describing the "issue", found the source of it :)
Oops, sorry for that - fixed!
colocation: I would like to leave this alone in this PR and open an issue - is that ok for you?
@mtitov : I addresses the comments, tests are passing again, so this should be ready for the next iteration :-) Thanks for your patience!
LGTM!!
Wohoo!
This implements #3108, #2974 and #2973