ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.62k stars 564 forks source link

feat: Consideration for Batch Data Retrieval Support? #8105

Open stereoF opened 5 months ago

stereoF commented 5 months ago

Is your feature request related to a problem?

I would like to propose a feature request for your consideration: is there any plan to support data retrieval in batches?

We currently face the following scenario:

1, We are ETLing data from Trino to ClickHouse. This ETL process may involve a series of data manipulations, with the resultant data being stored in ClickHouse. 2, We read data from ClickHouse for machine learning training purposes. If the dataset is large, we might need to read the data in batches for training and updating the model.

In both of these processes, attempting to read all the data at once could encounter limitations due to the memory capacity of a single machine. However, retrieving data in batches could avoid excessive memory consumption.

Is there a plan to support batch data retrieval, or perhaps there is a better solution already available?

Describe the solution you'd like

I would like to suggest adding support for data retrieval in batches, or alternatively, providing better solutions, such as dedicated ETL components.

What version of ibis are you running?

'7.1.0'

What backend(s) are you using, if any?

trino, clickhouse

Code of Conduct

lostmygithubaccount commented 5 months ago

hi @stereoF, thanks for opening! table.to_pandas_batches() and table.to_pyarrow_batches() are already supported, would that be sufficient for your usecase?

we're also thinking about efficient handoff to ML training from Ibis in the IbisML project (https://github.com/ibis-project/ibisml)