LeaveNhA commented 5 months ago

Hello, as you might know, I'm admiring your works (all of you guys, all the contributors) and love our community.

Apart from this start, here is my simple question:

Is there any plan to make it distributed or can I use any already written library/frameworks that I can use for this purpose?

Thank you, Good luck!

awni commented 5 months ago

Could you say more about what you are looking for? Distributed is a pretty generic term.

What exactly would you like to distribute? Training / inference? At what granularity? Any specific example?

LeaveNhA commented 5 months ago

Let's assume I have a couple of high-end apple device on my home. I wanna use them together to generate more processing power. Both for training and inference.

danilopeixoto commented 5 months ago

We could be referring to solutions like Spark or Dask (with local, Kubernetes, MPI, and other backends) for distributed data processing (that could eventually implement ML specifics, such as Dask-ML) or something very specific for ML from starting point like PyTorch and TensorFlow distributed training features.

Implementation examples

Dask

Set up

The setup depends on backend. In this example, we are defining a local cluster.

Start manager (scheduler):

dask scheduler

Start workers:

dask worker <dask-scheduler-address>

The cluster can also be created and managed using Python:

from dask.distributed import LocalCluster

# The cluster object allows to scale the number of workers remotely
cluster = LocalCluster(n_workers=1)
cluster.scale(2)

The workers need to discover the manager in the network and share access to resources such as files and data sources.

Usage

Standalone:

import dask.dataframe as dd

df = dd.read_csv(...)
df.x.sum().compute()

Local cluster:

from dask.distributed import Client
import dask.dataframe as dd

client = Client('<dask-scheduler-address>')

# Use the default client object from runtime
df = dd.read_csv(...)
df.x.sum().compute()

ML usage

Local cluster:

from dask.distributed import Client
import dask.dataframe as dd

from dask_ml.cluster import KMeans

client = Client('<dask-scheduler-address>')

# Use the default client object from runtime
df = dd.read_csv(...)

kmeans = KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
kmeans.fit(df.to_dask_array(lengths=True))
kmeans.predict(df.head()).compute()

We could implement a MLX collection backend (mlx-dask) for Dask: https://docs.dask.org/en/latest/how-to/selecting-the-collection-backend.html#defining-a-new-collection-backend

LeaveNhA commented 5 months ago

Man, you really enlighten the path. Yes, we can. Give me a couple of days to read the documentation and implementation details.

I have a question in advance, how far we have to dive in order to implement such a backend? Can we find best practises or can we get ideas of from other backends (I assume they have other backends).

Thank you, trully, Sincerely.

danilopeixoto commented 4 months ago

@LeaveNhA, I encountered challenges while working on a prototype with the Dask Backend Entrypoint API:

The MLX data type is not an alias of np.dtype as expected by Dask.
There could be additional compatibility issues with the MLX Random module.
It's important to note that the Dask Backend Entrypoint API is still in an experimental phase.

We could explore alternative methods such as Dask Custom Collection, Dask Delayed and Dask Futures to implement distributed computations.

Ray is also an interesting option to explore.

ml-explore / mlx-examples

Distributed Processing in any way? #374

Implementation examples

Dask

Set up

Usage

ML usage