Open mrocklin opened 6 years ago
cc @ogrisel @GaelVaroquaux @stsievert @amueller . Also, feel free to ping others who may be more involved in this topic or have more time to engage.
Well, as I mentioned IRL, my priority is to use scikit-learn in distributed settings on simple problems (for instance distributed grid-search of random forests), and to do benchmarks. I want to feel what are the bottlenecks, and maybe address them.
The blocker for this task is to have access to distributed hardware where I can do benchmarks.
I'm working on the distributed hardware aspect. I think it might be worth me calling you sometimes next week, with @yuvipanda in the room to see if we can set something up. Do you think what Yuvi proposed in #3 would work? I can explore other options as well.
I'll be taking vacations offline next week.
Would a kubernetes cluster with ssh access work? If you tell me your ideal computing environment, I can try to set up something as close as possible to it.
Would a kubernetes cluster with ssh access work?
I believe so. Using dask-kubernetes.
OK. I'll try to make sure that we have something set up before you arrive, and I'll try to see whether Yuvi can join us the first day of the sprint (there's a jupyter-dev meeting the same week)
Well, as I mentioned IRL, my priority is to use scikit-learn in distributed settings on simple problems (for instance distributed grid-search of random forests), and to do benchmarks. I want to feel what are the bottlenecks, and maybe address them.
Agreed. I have this same goal. I think that we'll have enough people for enough time that we can address a few issues at the same time.
The blocker for this task is to have access to distributed hardware where I can do benchmarks.
I don't anticipate that this will be a problem. I think that the JupyterHub + Dask-Kubernetes setup we have now for http://pangeo.pydata.org/ will suffice for this group. We can set up something similar with a software environment that we like fairly easily. You could use the current pangeo deployment today if you install sklearn in your local environment and also add it to your worker-template.yaml file in the EXTRA_PIP_PACKAGES
environment variable.
I'm inclined to try something like a parameter-server based SGD system.
There's interest here. I'd engage and am interested in extensions. Vanilla SGD would definitely work, and a lot of useful SGD variants that rely on it that use some very specific features (async reads/writes to model vector, coding).
A good paper that walks through the design and implementation of a high performance param server is "Scaling Distributed Machine Learning with the Parameter Server". Some of the features required for more particular algorithms are mentioned in sections 2 and 3 of "Communication Efficient Distributed Machine Learning with the Parameter Server", and extension of the previous work.
distributed grid-search of random forests
Adaptive hyper param tuning is related: https://github.com/dask/dask-ml/issues/161
It's probably relatively straight-forward to ask for cloud credits if that's your blocker @GaelVaroquaux. Google is usually generous, and I have some connections at microsoft.
I don't have time to spend time on this before scipy. Also, I will be offline for two weeks starting tomorrow because I get my tonsils removed.
also ping @jnothman I guess?
Cc me. Interested and happy to help in whatever I can during the sprint.
It looks like @mrocklin thinks pangeo.pydata.org is good enough here. LMK if that changes :)
Cc me. Interested and happy to help in whatever I can during the sprint.
@fabianp are there topics in particular that you'd like to pursue together?
parameter-server based SGD sounds like a good starting point, but open to other ideas that might surge. I've done some work on distributed/async methods but I'm quite new to dask.
I've opened up a longer-term issue on the dask-ml tracker: https://github.com/dask/dask-ml/issues/171
If others have the time, I'm inclined to experiment a bit with algorithms for distributed training. I think that this would be an interesting stress test of the technology, and would also, I think, raise some interest from external groups.
I'm inclined to try something like a parameter-server based SGD system. To do this I think that Dask needs to grow something like a low-latency inter-worker pub-sub system, which I'm happy to build beforehand.
Is this of sufficient interest that others would likely engage? If so, I welcome recommendations on papers to read, architectures to be aware of, and obstacles to avoid.