modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

How to achieve clustering in Modin and distributed machine learning #583

Closed Shellcat-Zero closed 2 years ago

Shellcat-Zero commented 5 years ago

A common scenario for my clustering operations is to have each node in my cluster (either Dask or standalone Pandas nodes) operating on a subset of a very large dataframe, and so when I saw issue https://github.com/modin-project/modin/issues/418 I assumed that the intended solution was to be able to read chunks of a query result into each of the cluster nodes, but then discovered that clustering in Modin is still under active development.

If there is a way to achieve clustering in Modin, even if it's a bit complicated, I was wondering if someone could explain how to achieve this. It may be something that I would be interested in contributing to. My most common clustering scenarios are on AWS EC2 and on local area networks.

In addition to clustering, I was wondering if there were any plans to extend Modin to scalable machine learning similar to how Dask-ML has done this alongside popular machine learning libraries like Scikit-Learn. Dask-ML has made great strides, but still needs a lot of work, and I was almost hoping that there may be some feasibility of developing it to the benefit of both Modin and Dask-ML. I almost always run into issues even running their examples, which requires extensive modifications to work correctly (if at all). Leveraging parallelized or distributed machine learning with Modin would also be something that I would have an interest in contributing to.

devin-petersohn commented 5 years ago

Great questions @Shellcat-Zero!

If there is a way to achieve clustering in Modin...

The best way to use Modin in a cluster is to start a Ray cluster: Documentation

Once you've started a Ray cluster you can start a Python interpreter in one of the nodes, and run:

import ray
ray.init(redis_address="<redis-address>")  # Information on redis address is in the docs linked above
import modin.pandas as pd

Modin will use the Ray cluster instead of starting a new local Ray instance if you import Ray first.

The reason we say it is under active development is because we are still working to optimize some of the common use-cases (e.g. distributing a CSV file that exists in one of the machines in the cluster).

In addition to clustering, I was wondering if there were any plans to extend Modin to scalable machine learning...

We do have a strong interest in supporting distributed machine learning applications. At the moment, there are some exploratory efforts going on at UC Berkeley, but nothing is concrete or production ready.

Leveraging parallelized or distributed machine learning with Modin would also be something that I would have an interest in contributing to.

We would love to have you contribute! With Modin, the way we started was with a general design and a simple proof of concept. In this case, it would make sense to pick an algorithm to serve as the proof of concept and start there.

Thanks again for the questions, let me know if you have any others!

Shellcat-Zero commented 5 years ago

Thanks @devin-petersohn, I've gotten tacit approval from my organization to work on this. AWS is my primary clustering platform, and I'm working on a proof of concept to achieve a Modin cluster using Boto3 and BundleWrap. I am a really big fan of BundleWrap for configuration management and provisioning.

I've created a convenience wrapper combining the functionalities of Boto3 with BundleWrap, and so far it is very easy deploying analytics environments with it. I'm about to extend it to Modin clustering and I may share it thereafter in case anyone else finds it to be useful.

devin-petersohn commented 5 years ago

That sounds great @Shellcat-Zero! Please let us know how it goes or if you run into any issues.

We can certainly pull something like that in if you have your company's permission and can share the code. We will need a place for it, but we can revisit that question when the time comes.

Shellcat-Zero commented 5 years ago

@devin-petersohn can you tell me where in the code that Ray is doing the EC2 deployment from the YAML config? I was curious to see how this is implemented, it looks like you're using boto3/botocore but I haven't found the relevant commands in-code yet.

devin-petersohn commented 5 years ago

It is in the autoscaler submodule of Ray:

Autoscaler directory: https://github.com/ray-project/ray/tree/master/python/ray/autoscaler Commands (I think you want this): https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/commands.py

Let me know if this doesn't help!

Shellcat-Zero commented 5 years ago

Thanks @devin-petersohn, I was able to find what I was looking for.

The reason we say it is under active development is because we are still working to optimize some of the common use-cases (e.g. distributing a CSV file that exists in one of the machines in the cluster).

Can you elaborate on what the shortcomings are at this time and where the development stands in regards to this? I tried looking up discussions for this in the Google Group but it gives me a permissions error. For example, is there a capability to replicate a csv from a head node across all child nodes in a cluster, or can a very large csv be chunked and have individual pieces sent to nodes and then have the result of computations aggregated in the head node?

devin-petersohn commented 5 years ago

For example, is there a capability to replicate a csv from a head node across all child nodes in a cluster, or can a very large csv be chunked and have individual pieces sent to nodes and then have the result of computations aggregated in the head node?

This is one of the current limitations.

If the entire file lives on the head node, we cannot broadcast it yet. Ray has a feature that puts computation where the data is, and we are working to implement a general solution to this problem in Modin. Because of this Ray feature, we cannot force data to be shuffle without deep knowledge of the cluster resources available. If the dataset is entirely on S3, there is no such issue.

The other limitation that we are actively working on (cc @williamma12) is partitioning, both number and how to partition the data. Thus far, we have hard coded the number of partitions in cluster environments for testing purposes. If you want to run in a cluster, pd.DEFAULT_NPARTITIONS will have to be higher than the default (which is the number of processors on the head node). As I mentioned we are actively working on this one.

Other than that, the only issue tends to be out of memory errors that Ray throws if the dataset is too large. There are some settings in Ray that cannot be set at the same time that limit the ability to do out of core processing in a cluster, but we are working on that as well.

aregm commented 4 years ago

@Shellcat-Zero please check the new modin.experimental.cloud API that is intended to address such cases. Feedback is welcomed!

mvashishtha commented 2 years ago

@Shellcat-Zero I'm closing this issue because I don't see any open questions. Please comment here if you have any more concerns.