Idle TaskTrackers - Githubissues

tarnfeld commented 10 years ago

This pull request introduces the ability to revoke slots from a running TaskTracker once it becomes idle. It contributes to solving #32 (map/reduce slot deadlock) as the cluster is able to remove slots that are idle and launch more when needed, avoid a deadlock situation (when resources are available).

TLDR; The JobTracker / Mesos Framework is able to launch and kill map and reduce slots in the cluster as they become idle, to make better use of those resources.

Essentially, what we've done here is separate the TaskTracker process from the slots. This means launching two mesos tasks attached to the same executor, one for the TaskTracker (as a task with potentially no resources) and another task which can be killed to free up mesos resources while keeping the TaskTracker itself alive. We attach the resources for "revokable" slots to the second task, reserving the ability to free up resources later on.

Note: This is kind of hacky, and is no where near worthy of testing in production yet. Working progress!

How does it work?

Given the use case, I am only dealing with the situation where a running TaskTracker is completely idle. For example, if we launch 10 task trackers with only map slots and 5 with only reduce slots, while the reduce phase is running the map slots (and resources associated with them) can become completely idle. These slots can be killed, as long as the TaskTracker is alive to serve map output to the reducer. It seems hadoop copes perfectly fine with TaskTrackers that have zero slots, too.

If we kill all map slots, we introduce potential failure cases where a node serving map data fails, and there are no map slots to re-compute the data. This is skirted around by only revoking a percentage of map slots from each TaskTracker (remaining = max(slots - (slots * 0.9), 1) by default).

Once a TaskTracker becomes alive, we check the "idleness" of the slots every 5 seconds, and if the while TaskTracker has no occupied slots for 5 checks, the next time round we'll revoke its slots. Currently the whole task tracker has to be idle for 30 seconds for slots to be revoked.

[x] Make it work properly
[x] Rebase all my nasty commits away
[x] Configuration, configuration, configuration
[x] Do all of the todos

I'd be very interested to hear what the community thinks of the solution. There's no doubt something obvious I have missed, but worth discussing the idea.

brndnmtthws commented 10 years ago

Looks like a good start!

tarnfeld commented 10 years ago

So this is working pretty well now, after various tweaks and fixes. I ran a small job on a small cluster to help force Hadoop into the situations highlighted, and the new behaviour is very apparent.

I first launched a map heavy job, then a reduce heavy job, together resulting in the entire cluster being allocated to Hadoop with 58 map slots and 2 reduce slots.

map-phase-slots

Once the first job finished, and the map phase of the second job finished, we're at a point where Hadoop has got 58 slots worth of compute power that it isn't putting to good use. The "idle tracker" code kicks in quite quickly and after about 30 seconds revokes slots from those task trackers (but keeping the tracker itself alive to serve map output) thus releasing the resources back to mesos, keeping a little behind to run the task tracker.

When those resources became available, mesos offered them back to Hadoop and Hadoop chose to launch reducers with those resources (since there's now only demand for reduce slots) resulting in a very fluid map and reduce slot allocation.

reduce-phase-slots

brndnmtthws commented 10 years ago

This is great Tom, looks brilliant!

tarnfeld commented 10 years ago

Thanks @brndnmtthws. When I started rolling this out in our docker environment (therefore now enforcing the resource limits, I wasn't using isolation in the above tests, wups) I ran into an issue that I overlooked. Depending on various timing elements, not all of the resources might have been given the executor by the time it needs them (when the task tracker starts up). By this I really mean memory resources.

In some situations, a lot of executors were being OOM killed (which led me to this kernel bug). Spark solves this problem by assigning all memory to the executor and all CPU to the tasks, because even when there are no tasks the executor will maintain it's memory limit and not be OOM killed. This helps solve the problem of not being able to change the JVM memory limits at runtime, as you never need to.

Given the ratio of CPU/RAM on our cluster, and the actual memory usage of our jobs, it will still be very beneficial to have this feature. Even if task trackers with zero slots are allocated many GBs of memory they're not using, there is still plenty free memory on the cluster to launch more task trackers, thus still allowing us to see the behaviour I described above with the map/reduce job.

Note: This problem does not really become noticeable when not enforcing limits with cgroups, because the JVM processes will free up memory over time and they'll just share the entire hosts memory space.

Thoughts?

brndnmtthws commented 10 years ago

Yeah, that could certainly be done. I'd take it a step further and suggest setting the CPU/mem separately for the executor and the tasks. Since the TT treats all the tasks as a pool, you'd have to treat all the tasks as one giant task, with the TT separate.

tarnfeld commented 10 years ago

Yeah, that could certainly be done. I'd take it a step further and suggest setting the CPU/mem separately for the executor and the tasks. Since the TT treats all the tasks as a pool, you'd have to treat all the tasks as one giant task, with the TT separate.

So that's kind of what's going on, though I think disk needs to also move over to the executor.

It's annoying that there's no way of reliably terminating an executor currently in Mesos. Having this feature would allow us to not use a task for the TaskTracker itself, just an executor. I think it should be adjusted to look like the following...

Executor has some CPU for itself, all the RAM (executor RAM + slots RAM), and all the disk
TaskTracker only has ports (since these only get bound to once it launches)
Slots get CPU only

brndnmtthws commented 10 years ago

Hey Tom, how's this stuff going? We're thinking of doing something similar, too. Thoughts? Is there anything I can do to help out?

tarnfeld commented 10 years ago

Hey Brenden! That's great news. Let me just note down the current state of things, my time has been sucked up by some other stuff recently so not had a chance to finish this off.

I made the changes we discussed around memory / CPU. Now we free up CPUs when task trackers become idle, but not memory, due to issues with the OOM killer and not being able to resize a running JVM.
Currently we rely on ordering of the tasks being launched, we assume the task tracker is launched first, then the slots second, however due to MESOS-1812. I started a thread on the mailing list about some kind of shutdownExecutor method and it has been suggested to do the following...
- I think we should watch for when the "slots" task launches and is picked up by the executor, and until this happens not send any tasks to the task tracker. This will guarantee we don't try and run tasks on an executor that doesn't have the right amount of resources.
- There are various timing issues with the current implementation, and I think we need to make some adjustments to the task structure.

TLDR; Some testing of what exists here would be really great, I hope to spend some time on it today or tomorrow, and implement the above ideas at least in a basic way.

brndnmtthws commented 10 years ago

Do you have some time to chat about it? Maybe tomorrow morning? (I guess that's afternoon your time?)

On Wed, Oct 1, 2014 at 12:23 AM, Tom Arnfeld notifications@github.com wrote:

Hey Brenden! That's great news. Let me just note down the current state of things, my time has been sucked up by some other stuff recently so not had a chance to finish this off.

I made the changes we discussed around memory / CPU. Now we free up CPUs when task trackers become idle, but not memory, due to issues with the OOM killer and not being able to resize a running JVM.

Currently we rely on ordering of the tasks being launched, we assume the task tracker is launched first, then the slots second, however due to MESOS-1812 https://issues.apache.org/jira/browse/MESOS-1812. I started a thread on the mailing list about some kind of shutdownExecutor method and it has been suggested to do the following...

I think we should watch for when the "slots" task launches and is picked up by the executor, and until this happens not send any tasks to the task tracker. This will guarantee we don't try and run tasks on an executor that doesn't have the right amount of resources.

There are various timing issues with the current implementation, and I think we need to make some adjustments to the task structure.

TLDR; Some testing of what exists here would be really great, I hope to spend some time on it today or tomorrow, and implement the above ideas at least in a basic way.

— Reply to this email directly or view it on GitHub https://github.com/mesos/hadoop/pull/33#issuecomment-57427855.

tarnfeld commented 9 years ago

@brndnmtthws Hey. I'm getting ready to merge this branch and push out a new release now. We've been running the tip of this branch for a month and have seen zero issues, it's also pretty memory efficient now.

I need to make sure the version numbers and docs are all updated first, but any objections?

brndnmtthws commented 9 years ago

No objections from me.

:ship: :it:

mesos / hadoop

Idle TaskTrackers #33

How does it work?