Open njriasan opened 2 years ago
Here is an example of the impact of dictionary encoded string arrays using Bodo.
In [1]: df = pd.DataFrame({"A": ["abc", "df"] * 100000})
In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 200000 non-null object
dtypes: object(1)
memory usage: 1.5+ MB
In [3]: df.to_parquet("dict_str_test.pq")
In [4]: @bodo.jit
...: def f():
...: df = pd.read_parquet("dict_str_test.pq")
...: print([df.info](http://df.info/)())
In [5]: f()
<class 'DataFrameType'>
RangeIndexType(none): 200000 entries, 0 to 199999
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 200000 non-null unicode_type
dtypes: unicode_type(1)
memory usage: 805.7 KB
The input Parquet file has a lot of repetition so it's compressed using a dictionary. In this example you can see that despite the string only having 2 or 3 characters, this cuts the memory usage roughly in half.
@sfc-gh-zpeng is this the same issue?
What is the current behavior?
Currently when reading string data, the Snowflake connector always returns the data as a regular string array.
What is the desired behavior?
Snowflake should support returning data using Arrow's dictionary encoded arrays. Dictionary encoded arrays represent string arrays using a dictionary of unique values and an index array that maps each entry in the array to a location in the dictionary.
To simplify the implementation, I would make it a requirement that users request specific columns be read with dictionary encoding.
How would this improve
snowflake-connector-python
?Dictionary encoded arrays can have a significant impact on the memory usage experienced when loading data with duplicates. Compute engines such as Spark and Bodo currently attempt to keep string data dictionary encoded for as long as possible when reading from parquet. However, since the Snowflake connector cannot pass data as dictionary encoded arrays applications fetching data from Snowflake will have significantly larger memory requirements.
References, Other Background