tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 392 forks source link

Marathon/Mesos and docker #25

Closed bhack closed 7 years ago

bhack commented 7 years ago

Why Docker is still mandatory? See https://github.com/douban/tfmesos/issues/12

bhack commented 7 years ago

/cc @windreamer

windreamer commented 7 years ago

Hmm, it is a wild idea... maybe the most difficult part is how to map GPUs into different tasks?

bhack commented 7 years ago

/cc @yuefengz

bhack commented 7 years ago

Ping

windreamer commented 7 years ago

It seems docker is not a must. it seems we can just using GPU allocator of Mesos to allocate resources. But experiments are still on-going.

yuefengz commented 7 years ago

Sorry, was just back from vacation last week and didn't see your issue. Will take a look at your issue shortly.

bhack commented 7 years ago

@yuefengz Thank you. @Windreamer has already experimented the optional docker dependency in tfmesos.

bhack commented 7 years ago

Obviously I think that TPU resources are available only in managed google cloud and not in DC/OS on google compute engine right?

bhack commented 7 years ago

/cc @klueska

klueska commented 7 years ago

It seems that the question is about GPU support in DC/OS? Is that correct?

GPUs will be supported in the upcoming DC/OS 1.9 release with the limitations outlined in this pull request: https://github.com/dcos/dcos/pull/766

The full documentation is not yet complete (it will be ready by 1.9 EA release coming out in early February though). Here is a preview: https://github.com/dcos/dcos-docs/blob/a20548c343a75258ea70799efb9d98c9c6aeeaf7/1.8/usage/gpu-support.md

bhack commented 7 years ago

@klueska Yes cause actually docker it is mandatory

klueska commented 7 years ago

What do you mean by "docker is mandatory". Can you give a little more detail on the context?

DC/OS actually supports two different container runtimes (The Docker Container Runtime and the Universal Container Runtime), both of which are able to run docker containers. The Docker Container Runtime does not yet support GPUs, but the Universal Container Runtime does.

bhack commented 7 years ago

Yes I know and tfmesos use both the solutions but see the flavour on how it is integrated here: https://github.com/tensorflow/ecosystem/blob/master/marathon/README.md

klueska commented 7 years ago

I see. Looking at https://github.com/tensorflow/ecosystem/blob/master/marathon/template.json.jinja I see no reason that this can't be modified to run with the Universal Container Runtime instead of the Docker Container Runtime (a.k.a. the docker containerizer).

The only change would be to update the container type to "MESOS" and allocate it some "gpus: xxx" if desired.

In fact, that would be the recommended way of running it going forward.

bhack commented 7 years ago

Yes and this originated my original question. I'am also courious on how new hardware devices like TPU or other kind of accellerators could be exposed in the Universal Container Runtime.

klueska commented 7 years ago

@windreamer Regarding: Hmm, it is a wild idea... maybe the most difficult part is how to map GPUs into different tasks?

You could consider launching all of the tasks as part of a task-group (aka Pod), in which case they would all share access to the same set of GPUs instead of having to allocate a different GPU to each of them.

klueska commented 7 years ago

@bhack Regarding:

Yes and this originated my original question. I'am also courious on how new hardware devices like TPU or other kind of accellerator could be exposed in the Universal Container Runtime.

That sounds like a great question for the mesos development list: http://mesos.apache.org/community/

There is nothing fundamental about how we do GPU allocation that couldn't be extended to TPUs.

bhack commented 7 years ago

@klueska Yes but I don't have a TPU :) Only google has this ASIC and I think will be only available in managed cloud so we cannot see it as Mesos resource also on Google Cloud.

yuefengz commented 7 years ago

Thanks @klueska ! I will try it when DC/OS 1.9 is available: The only change would be to update the container type to "MESOS" and allocate it some "gpus: xxx" if desired.

yuefengz commented 7 years ago

@bhack We will give more details about TPU but for now TPU is beyond the scope. We don't have to worry about supporting TPU, especially in Mesos, in the near future.

windreamer commented 7 years ago

@bhack I am quite positive docker is not necessary for tfmesos cluster. mesos uses cgroup's device whitelist to isolate GPUs from each task. So when a GPU is allocated, only the task it assinged to can use this device.

I thought docker was a must is based on a guess: Tensorflow uses GPUs starting with id 0, and when with nvidia-docker GPUs are automatically re-numbered in docker. But based on my experiments, it seemd Tensorflow also do EnumerateDevices to discover all available GPUs. And thus re-numbering GPU devices is not nessesary.

windreamer commented 7 years ago

But please note most our node has only one GPU, so my experiment may have been wrongly done.

bhack commented 7 years ago

@yuefengz Thank you. This mean that TPU will be available only on managed Google Cloud for now. Right?

jhseu commented 7 years ago

Note that we can't comment on any TPU plans :)

bhack commented 7 years ago

@jhseu Ok we wait you to be ready to release more info. Instead, will be there an impact of XLA in Mesos?

jhseu commented 7 years ago

It's unlikely XLA will directly affect Mesos. The primary benefits of XLA are:

bhack commented 7 years ago

Mhh probably on task migration XLA could require to emit new native code or another ops fusing strategy. What do you think?

jhseu commented 7 years ago

XLA shouldn't affect task migration because each worker will JIT compile native code when ops are executed for the first time.

bhack commented 7 years ago

Ok so JIT will be executed on each worker as required.

bhack commented 7 years ago

@klueska Is 1.9 ea released right? Could we update the container type in this repository example and related docs?

klueska commented 7 years ago

@bhack Yes, DC/OS 1.9 EA was released last night. Give it a shot and let me know if there are any problems.

bhack commented 7 years ago

@jhseu @yuefengz?

bhack commented 7 years ago

The patch it is really trivial but I want to ask to maintainers if they have a cluster to test it.

jhseu commented 7 years ago

Yeah, we can set up a cluster to test it, so feel free to send a pull request.

klueska commented 7 years ago

I can test it independently as well.

On Wed, Mar 15, 2017 at 11:30 AM Jonathan Hseu notifications@github.com wrote:

Yeah, we can set up a cluster to test it, so feel free to send a pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/ecosystem/issues/25#issuecomment-286837255, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF4owyZCNgoI330r1Tsdv-FY5EfZcMhks5rmC5QgaJpZM4LB_zu .

bhack commented 7 years ago

Ok please comment https://github.com/tensorflow/ecosystem/pull/38

bhack commented 7 years ago

@windreamer @klueska Any plan to push Tensorflow on DC/OS universe?

klueska commented 7 years ago

Ideally, yes. I am still trying to understand what exactly the architecture of this would look like though.

My current thinking is that we would have a package that gets a Tensorflow server up and running on every node in the cluster (or some subset of nodes in the cluster if specified at package install time). A user would first install this package from the DC/OS universe to get the servers up and running. They would then deploy separate marathon apps to submitted jobs against the tensorflow servers.

I personally haven't played around enough with Tensorflow to know if this is possible (or even makes sense architecturally), but that is my current thinking.

Any feedback is welcome (and encouraged).

On Wed, Mar 15, 2017 at 12:29 PM bhack notifications@github.com wrote:

@windreamer https://github.com/windreamer @klueska https://github.com/klueska Any plan to push Tensorflow on DC/OS universe https://github.com/mesosphere/universe?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/ecosystem/issues/25#issuecomment-286853231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF4o7OpUYJiy8zwakmlEh_43yJFkGFjks5rmDwEgaJpZM4LB_zu .

bhack commented 7 years ago

@klueska Take a look at TF distributed

bhack commented 7 years ago

It is also important https://github.com/tensorflow/tensorflow/issues/2126

bhack commented 7 years ago

https://github.com/tensorflow/ecosystem/pull/38 is merged. I've opened https://github.com/mesosphere/universe/issues/1026 for the universe topic.

bhack commented 7 years ago

I want to ask to all if makes sense to integrate more pymesos and tfmesos features here. What do you think?

bhack commented 7 years ago

Another probable alternative is https://github.com/daskos/mentos/ /cc @daskos

bhack commented 7 years ago

DC/OS 1.9 stable release is generally available

bhack commented 7 years ago

I see no activity here. Any further idea or action? Do we want to close it?

bhack commented 7 years ago

@klueska Is there any other feedback on how to improve DC/OS Mesos experience?

bhack commented 7 years ago

How this news hints can improve the mesos demo? Could be created another example on a "real" dataset like imagenet?

bhack commented 7 years ago

This thread it is quite dead. Try to see if there is any interest by @dcos team.

bhack commented 7 years ago

In the meantime https://dcos.io/blog/2017/tutorial-deep-learning-with-tensorflow-nvidia-and-apache-mesos-dc-os-part-1/index.html /cc @sascala

bhack commented 7 years ago

@jhseu After Google I/O I could suppose that TPU could not be related anymore to Ecosystem cause it is full managed. Is this conclusion correct? If you adopt ecosystem/mesos or DC/OS your are out of TPU device offering.