Feature request: read a large DataFrame in chunks

man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

http://arcticdb.io

Other

1.42k stars 92 forks source link

Feature request: read a large DataFrame in chunks #1709

Open Crypto7816 opened 1 month ago

Crypto7816 commented 1 month ago

Is your feature request related to a problem? Please describe. If we have a really large dataframe that exceeds memory, and we need to process each part of it, parquet supports batch_size. I'm wondering if lib.read has similar functionality.

def read_parquet_in_batches(file_path, batch_size=10000):
    parquet_file = pq.ParquetFile(file_path)
    for batch in parquet_file.iter_batches(batch_size=batch_size):
        yield batch.to_pandas()

DrNickClarke commented 1 month ago

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Crypto7816 commented 1 month ago

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Yes, it can indeed be done using row_range or date_range，Can articdb implement a function like this, making it easier for users to operate on each batch instead of the entire dataframe, which would save a lot of memory?

def fetch_batch_from_arcticdb(
    symbol: str,
    start: str,
    end: str,
    batch_size: int = 1440,
    uri: str = 'lmdb://crypto_database.lmdb',
    library: str = 'binance',
):
    ac = adb.Arctic(uri)
    lib = ac[library]

    start_date = pd.Timestamp(start)
    end_date = pd.Timestamp(end)

    while start_date < end_date:
        batch_end = min(start_date + timedelta(minutes=batch_size), end_date)

        df = lib.read(symbol, date_range=(start_date, batch_end)).data

        yield df

        start_date = batch_end

Crypto7816 commented 1 month ago

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Hi, any new updates?

vasil-pashov commented 1 month ago

Hi. You can do this using the row_range argument of the read function. We are planning various future improvements that will make this easier to use and faster.

Hi, any new updates?

Hi, the road map is not completely sorted out. @DrNickClarke will get back to you with more info.

DrNickClarke commented 3 weeks ago

Hi. Sorry for the delay coming back on this. Thank you for your suggestion. We definitely have plans to make chunking easier going forward. It has not reached the top of the priority list at this time but I hope you will be pleased to see the announcements we will be making in the near future.

Crypto7816 commented 3 weeks ago

Hi. Sorry for the delay coming back on this. Thank you for your suggestion. We definitely have plans to make chunking easier going forward. It has not reached the top of the priority list at this time but I hope you will be pleased to see the announcements we will be making in the near future.

I find some great features in dask. It would be great if arctic can implement it.