snowflakedb / snowflake-connector-python

Snowflake Connector for Python
https://pypi.python.org/pypi/snowflake-connector-python/
Apache License 2.0
576 stars 467 forks source link

SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

Open willsthompson opened 1 year ago

willsthompson commented 1 year ago

What is the current behavior?

Batch size is not controllable from the client when using batch-generating functions, e.g. get_result_batches(), fetch_arrow_batches(), fetch_pandas_batches()

What is the desired behavior?

Allow specification of a batch_size parameter when making batch requests that determines the number of records returned in each batch.

How would this improve snowflake-connector-python?

Many applications require tight control over memory usage to operate reliably. This applies to essentially any service running in a remote server, i.e. not a user's laptop. Our application provides connections to multiple databases and cloud storage providers, and the only way we can provide the equivalent level of reliability (every other database and storage provider we support has this feature available in their connector) is for Snowflake to include the ability to control the size of responses for large requests.

References and other background

kylejcaron commented 8 months ago

It would still solve our problem (and I think most similar problems) if the returned batches are only close, but not exactly the requested size. Generally, we only want to ensure we do not receive batches so big that a worker process runs out of memory or receive many batches so small that it takes too long to iterate and merge them.

Hope y'all don't mind me chiming in here - lot of ML frameworks require tightly controlled batch sizes. An example pattern thats compatible with alot of these frameworks is to_torch_datapipe from snowflake.ml

It might be worth finding an expert to weigh in that uses Tensorflow/PyTorch/Jax at scale

bitshop commented 8 months ago

I found this case searching around thinking I must be misreading the docs or just not finding another command name in the connector to fetch up to a specific size. NOTE: I suggest this may be worth breaking into two requests:

  1. Specify a MAX batch size, for memory constrained apps this is fine
  2. Specify an EXACT batch size - This requires more compute on Snowflake's side potentially to borrow from other batches to get ones fetched to be the exact size - Assuming a highly distributed query I would think each producer of rows would only publish their number of rows - Hence thinking this request is harder.

To be clear the difference is #1 is about memory constraints, if some batches have 1 row and others the max that's acceptable in that use case.

sfc-gh-dszmolka commented 5 months ago

thank you for opening this request with us - we'll consider it for a possible future improvement in the connector

JHuangg commented 5 months ago

is there a way to understand how batch size is being generated?

jordantshaw commented 1 month ago

This would be a very helpful feature.