feat: `to_dask` (or more generally `to_batches`?)

jcrist commented 2 months ago

Opening this mostly for discussion.

Say all your data lives in <big cloud provider db>. After doing some selecting/filtering/transforming, you want to export your data out of the DB and into a different distributed system like dask (or spark or others) to do some operations (ML training for example) that can't as easily be executed purely in the database backend.

Some of our backends provide efficient means for distributed batch retrieval. By this I mean a way to fetch query results in parallel (perhaps across a distributed system) rather than streaming them back through the client. In these cases, conversion of a result set to a distributed object (like a dask.dataframe) could be done fairly efficiently, and in a way that the user can't easily compose using existing API methods.

Systems that support this natively:

dask
spark
bigquery
snowflake

We could support this as a general method for systems where this is inefficient, but I'm not sure if we'd want to do that. Better to error than accidentally slowly pipe data through the client and back out to a cluster (a user can fairly easily write this code themselves too).

We could expose this as a to_dask method on an expression that does all the fiddly bits and returns a dask.dataframe object.

Alternatively (or additionally), we could generalize this to a to_batches (or better name) method that returns a list of Batch objects, each of which has a to_pandas/to_arrow/to_polars methods for fetching the partition as a specified type. These could be pickleable and distributed to any distributed system (dask/spark/ray/...).

Conversion to a dask dataframe would then be something like:

import dask.dataframe as dd

batches = expr.to_batches()
ddf = dd.from_map(lambda batch: batch.to_pandas(), batches, meta=expr.schema().to_pandas())

ncclementi commented 2 months ago

I like the idea of a general to_batches that is general, and then if needed/asked we could create the specific APIs like to_dask_df / to_spark_df

cpcloud commented 2 months ago

I think starting with to_dask makes sense. Supporting a general batching API doesn't (yet) seem worth the effort.

jitingxu1 commented 2 months ago

Hi @jcrist , Thank you for creating this. Transferring data directly between the compute backend and another cluster, bypassing the client, is crucial for efficient ML training.

We could start from to_dask and have a general to_batches would be perfect. We could connect compute backend to different kinds of training cluster, such as pytorch, tensorflow.

Please let me know if you need any help from me.

cpcloud commented 2 months ago

Not seeing exactly what to_batches is getting us here. Is this motivated by an ibis-ml use case?

jitingxu1 commented 2 months ago

to_batches could be more general, we could convert the batches to different format, i.e. dask, tensor or torch.

it would be great if we could pass the data from some other backends to the training cluster without going through the client.

One direct use case for IbisML, we could demo large scale training using spark or bigquery + xgboost/torch,.

ibis-project / ibis

feat: `to_dask` (or more generally `to_batches`?) #9891