SNOW-590042: Support reading string data as dictionary-encoded arrays via fetch_arrow_batches

njriasan commented 2 years ago

What is the current behavior?

Currently when reading string data, the Snowflake connector always returns the data as a regular string array.

What is the desired behavior?

Snowflake should support returning data using Arrow's dictionary encoded arrays. Dictionary encoded arrays represent string arrays using a dictionary of unique values and an index array that maps each entry in the array to a location in the dictionary.

To simplify the implementation, I would make it a requirement that users request specific columns be read with dictionary encoding.

How would this improve `snowflake-connector-python`?

Dictionary encoded arrays can have a significant impact on the memory usage experienced when loading data with duplicates. Compute engines such as Spark and Bodo currently attempt to keep string data dictionary encoded for as long as possible when reading from parquet. However, since the Snowflake connector cannot pass data as dictionary encoded arrays applications fetching data from Snowflake will have significantly larger memory requirements.

References, Other Background

njriasan commented 2 years ago

Here is an example of the impact of dictionary encoded string arrays using Bodo.

In [1]: df = pd.DataFrame({"A": ["abc", "df"] * 100000})

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   A       200000 non-null  object
dtypes: object(1)
memory usage: 1.5+ MB

In [3]: df.to_parquet("dict_str_test.pq")

In [4]: @bodo.jit
     ...: def f():
     ...:     df = pd.read_parquet("dict_str_test.pq")
     ...:     print([df.info](http://df.info/)())

In [5]: f()
<class 'DataFrameType'>
RangeIndexType(none): 200000 entries, 0 to 199999
Data columns (total 1 columns):
#   Column  Non-Null Count  Dtype
--- ------  --------------  -----
 0  A       200000 non-null      unicode_type
dtypes: unicode_type(1)
memory usage: 805.7 KB

The input Parquet file has a lot of repetition so it's compressed using a dictionary. In this example you can see that despite the string only having 2 or 3 characters, this cuts the memory usage roughly in half.

sfc-gh-madkins commented 2 years ago

@sfc-gh-zpeng is this the same issue?

snowflakedb / snowflake-connector-python