Open adampinky85 opened 7 months ago
The reason is how categorical index are generated in memory, in your case is exactly the difference between int32 and int8.
You can check it:
print(df1['fruit'].dtype) # dictionary<values=string, indices=int32, ordered=0>[pyarrow]
print(df2['fruit'].cat.codes.dtype) # int8
However, it seems generated at the lower level (I'm not an expert of pandas). It seems reproducible using pyarrow only:
import pyarrow.parquet as pq
import pandas as pd
import pyarrow as pa
def pyarrow_categorical_dtype(vals):
"""
Generate a dictionary array starting from a list of values
https://arrow.apache.org/docs/python/generated/pyarrow.DictionaryArray.html
"""
as_dict_vals = pa.array(vals).dictionary_encode()
return pd.ArrowDtype(as_dict_vals.type)
df.dtypes # fruit category dtype: object
dfa = df.astype(pyarrow_categorical_dtype(df['fruit']))
dfa.dtypes # dictionary<values=string, indices=int8, ordere...
# It is using an int8 (the optimal type) as index
dfa.to_parquet(temp_file)
Now reloading the file using pandas or pyarrow it seems not more able to understand the type:
df4=pd.read_parquet(temp_file)
df4 = pq.read_table(temp_file).to_pandas(ignore_metadata=False, types_mapper=pd.ArrowDtype)
# Both give the same error
# ValueError: format number 1 of "dictionary<values=string, indices=int8, ordered=0>[pyarrow]" is not recognized
This make me think that the problem is during loading. This can be tested ignoring this error:
pq.read_table(temp_file).to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype)
# fruit dictionary<values=string, indices=int32,
This works but the index is again on the default value int32, no more the optimal type
tested with py3.10 pd2.2.1 Glauco
Thanks, I would assume it's a common case for categorical data to have low cardinality that fits within an int8
.
On disk Parquet appears to store the category data as logical type String
which is compressed with snappy
and encoded: https://parquet.apache.org/docs/file-format/data-pages/encodings/
It's really important for our use case due to the large volume of data (billions of rows) to ensure our in-memory representation is optimal. If the team could review and provide any advice that would be much appreciated! thanks
Metadata
<pyarrow._parquet.FileMetaData object at 0x7f0c6f204630>
created_by: parquet-cpp-arrow version 15.0.2
num_columns: 1
num_rows: 100000000
num_row_groups: 96
format_version: 2.6
serialized_size: 11706
Schema
<pyarrow._parquet.ParquetSchema object at 0x7f0c6ec57e80>
required group field_id=-1 schema {
optional binary field_id=-1 fruit (String);
}
Column Metadata
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f0c6eccf650>
file_offset: 264371
file_path:
physical_type: BYTE_ARRAY
num_values: 1048576
path_in_schema: fruit
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f0c6eccf5b0>
has_min_max: True
min: apple
max: cherry
null_count: 0
distinct_count: None
num_values: 1048576
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 49
total_compressed_size: 264367
total_uncompressed_size: 264347
Column Schema
<ParquetColumnSchema>
name: fruit
path: fruit
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Categorical columns that are loading using the PyArrow dtype backend require 4x the memory consumption of Numpy nullable.
Expected Behavior
The memory consumption should be the same as using categorical fields across both backend types.
Installed Versions