Open jcrist opened 2 months ago
I like the idea of a general to_batches
that is general, and then if needed/asked we could create the specific APIs like to_dask_df
/ to_spark_df
I think starting with to_dask
makes sense. Supporting a general batching API doesn't (yet) seem worth the effort.
Hi @jcrist , Thank you for creating this. Transferring data directly between the compute backend and another cluster, bypassing the client, is crucial for efficient ML training.
We could start from to_dask
and have a general to_batches
would be perfect. We could connect compute backend to different kinds of training cluster, such as pytorch, tensorflow.
Please let me know if you need any help from me.
Not seeing exactly what to_batches
is getting us here. Is this motivated by an ibis-ml use case?
to_batches
could be more general, we could convert the batches to different format, i.e. dask, tensor or torch.
it would be great if we could pass the data from some other backends to the training cluster without going through the client.
One direct use case for IbisML, we could demo large scale training using spark or bigquery + xgboost/torch,.
Opening this mostly for discussion.
Say all your data lives in
<big cloud provider db>
. After doing some selecting/filtering/transforming, you want to export your data out of the DB and into a different distributed system likedask
(or spark or others) to do some operations (ML training for example) that can't as easily be executed purely in the database backend.Some of our backends provide efficient means for distributed batch retrieval. By this I mean a way to fetch query results in parallel (perhaps across a distributed system) rather than streaming them back through the client. In these cases, conversion of a result set to a distributed object (like a
dask.dataframe
) could be done fairly efficiently, and in a way that the user can't easily compose using existing API methods.Systems that support this natively:
dask
spark
bigquery
snowflake
We could support this as a general method for systems where this is inefficient, but I'm not sure if we'd want to do that. Better to error than accidentally slowly pipe data through the client and back out to a cluster (a user can fairly easily write this code themselves too).
We could expose this as a
to_dask
method on an expression that does all the fiddly bits and returns adask.dataframe
object.Alternatively (or additionally), we could generalize this to a
to_batches
(or better name) method that returns a list ofBatch
objects, each of which has ato_pandas
/to_arrow
/to_polars
methods for fetching the partition as a specified type. These could be pickleable and distributed to any distributed system (dask/spark/ray/...).Conversion to a dask dataframe would then be something like: