scisprints / 2018_05_sklearn_skimage_dask

BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Algorithms for distributed training #4

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

If others have the time, I'm inclined to experiment a bit with algorithms for distributed training. I think that this would be an interesting stress test of the technology, and would also, I think, raise some interest from external groups.

I'm inclined to try something like a parameter-server based SGD system. To do this I think that Dask needs to grow something like a low-latency inter-worker pub-sub system, which I'm happy to build beforehand.

Is this of sufficient interest that others would likely engage? If so, I welcome recommendations on papers to read, architectures to be aware of, and obstacles to avoid.

mrocklin commented 6 years ago

cc @ogrisel @GaelVaroquaux @stsievert @amueller . Also, feel free to ping others who may be more involved in this topic or have more time to engage.

GaelVaroquaux commented 6 years ago

Well, as I mentioned IRL, my priority is to use scikit-learn in distributed settings on simple problems (for instance distributed grid-search of random forests), and to do benchmarks. I want to feel what are the bottlenecks, and maybe address them.

The blocker for this task is to have access to distributed hardware where I can do benchmarks.

NelleV commented 6 years ago

I'm working on the distributed hardware aspect. I think it might be worth me calling you sometimes next week, with @yuvipanda in the room to see if we can set something up. Do you think what Yuvi proposed in #3 would work? I can explore other options as well.

GaelVaroquaux commented 6 years ago

I'll be taking vacations offline next week.

NelleV commented 6 years ago

Would a kubernetes cluster with ssh access work? If you tell me your ideal computing environment, I can try to set up something as close as possible to it.

GaelVaroquaux commented 6 years ago

Would a kubernetes cluster with ssh access work?

I believe so. Using dask-kubernetes.

NelleV commented 6 years ago

OK. I'll try to make sure that we have something set up before you arrive, and I'll try to see whether Yuvi can join us the first day of the sprint (there's a jupyter-dev meeting the same week)

mrocklin commented 6 years ago

Well, as I mentioned IRL, my priority is to use scikit-learn in distributed settings on simple problems (for instance distributed grid-search of random forests), and to do benchmarks. I want to feel what are the bottlenecks, and maybe address them.

Agreed. I have this same goal. I think that we'll have enough people for enough time that we can address a few issues at the same time.

The blocker for this task is to have access to distributed hardware where I can do benchmarks.

I don't anticipate that this will be a problem. I think that the JupyterHub + Dask-Kubernetes setup we have now for http://pangeo.pydata.org/ will suffice for this group. We can set up something similar with a software environment that we like fairly easily. You could use the current pangeo deployment today if you install sklearn in your local environment and also add it to your worker-template.yaml file in the EXTRA_PIP_PACKAGES environment variable.

stsievert commented 6 years ago

I'm inclined to try something like a parameter-server based SGD system.

There's interest here. I'd engage and am interested in extensions. Vanilla SGD would definitely work, and a lot of useful SGD variants that rely on it that use some very specific features (async reads/writes to model vector, coding).

A good paper that walks through the design and implementation of a high performance param server is "Scaling Distributed Machine Learning with the Parameter Server". Some of the features required for more particular algorithms are mentioned in sections 2 and 3 of "Communication Efficient Distributed Machine Learning with the Parameter Server", and extension of the previous work.

distributed grid-search of random forests

Adaptive hyper param tuning is related: https://github.com/dask/dask-ml/issues/161

amueller commented 6 years ago

It's probably relatively straight-forward to ask for cloud credits if that's your blocker @GaelVaroquaux. Google is usually generous, and I have some connections at microsoft.

I don't have time to spend time on this before scipy. Also, I will be offline for two weeks starting tomorrow because I get my tonsils removed.

amueller commented 6 years ago

also ping @jnothman I guess?

fabianp commented 6 years ago

Cc me. Interested and happy to help in whatever I can during the sprint.

yuvipanda commented 6 years ago

It looks like @mrocklin thinks pangeo.pydata.org is good enough here. LMK if that changes :)

mrocklin commented 6 years ago

Cc me. Interested and happy to help in whatever I can during the sprint.

@fabianp are there topics in particular that you'd like to pursue together?

fabianp commented 6 years ago

parameter-server based SGD sounds like a good starting point, but open to other ideas that might surge. I've done some work on distributed/async methods but I'm quite new to dask.

mrocklin commented 6 years ago

I've opened up a longer-term issue on the dask-ml tracker: https://github.com/dask/dask-ml/issues/171