BUG: Categorical columns using the PyArrow backend requires 4x more memory

adampinky85 commented 7 months ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import os
import tempfile

import numpy as np
import pandas as pd

# create dataframe of categorical data
select = ("apple", "banana", "cherry")
data = np.random.choice(select, size=10**8)
df = pd.DataFrame(data=data, columns=("fruit",), dtype="category")
df.shape  # (100000000, 1)

temp_fd, temp_file = tempfile.mkstemp()

# saves dataframe to parquet file
df.to_parquet(temp_file)

# open dataframe with pyarrow backend
df1 = pd.read_parquet(temp_file, dtype_backend="pyarrow")
df1.fruit.dtype  # dictionary<values=string, indices=int32, ordered=0>[pyarrow]
df1.memory_usage(deep=True).fruit / 1024**2  # 381.4908285140991

# open dataframe with numpy nullable backend
df2 = pd.read_parquet(temp_file, dtype_backend="numpy_nullable")
df2.fruit.dtype  # CategoricalDtype(categories=['apple', 'banana', 'cherry'], ordered=False, categories_dtype=object)
df2.memory_usage(deep=True).fruit / 1024**2  # 95.36769104003906

os.close(temp_fd)

Issue Description

Categorical columns that are loading using the PyArrow dtype backend require 4x the memory consumption of Numpy nullable.

Expected Behavior

The memory consumption should be the same as using categorical fields across both backend types.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.12.2.final.0 python-bits : 64 OS : Linux OS-release : 4.14.336-257.566.amzn2.x86_64 Version : #1 SMP Sat Mar 9 09:49:51 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2023.3 dateutil : 2.9.0.post0 setuptools : 69.2.0 pip : 24.0 Cython : 3.0.9 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.18.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.8 dataframe-api-compat : None fastparquet : None fsspec : 2023.6.0 gcsfs : None matplotlib : 3.8.3 numba : 0.59.1 numexpr : 2.9.0 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : None xarray : 2024.2.0 xlrd : 2.0.1 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

glaucouri commented 6 months ago

The reason is how categorical index are generated in memory, in your case is exactly the difference between int32 and int8.

You can check it:

print(df1['fruit'].dtype)  # dictionary<values=string, indices=int32, ordered=0>[pyarrow]
print(df2['fruit'].cat.codes.dtype) # int8

However, it seems generated at the lower level (I'm not an expert of pandas). It seems reproducible using pyarrow only:

import pyarrow.parquet as pq
import pandas as pd
import pyarrow as pa

def pyarrow_categorical_dtype(vals):
    """ 
    Generate a dictionary array starting from a list of values
    https://arrow.apache.org/docs/python/generated/pyarrow.DictionaryArray.html
    """
    as_dict_vals = pa.array(vals).dictionary_encode()
    return pd.ArrowDtype(as_dict_vals.type)

df.dtypes  # fruit    category dtype: object
dfa = df.astype(pyarrow_categorical_dtype(df['fruit']))
dfa.dtypes  # dictionary<values=string, indices=int8, ordere...
# It is using an int8 (the optimal type) as index
dfa.to_parquet(temp_file)

Now reloading the file using pandas or pyarrow it seems not more able to understand the type:

df4=pd.read_parquet(temp_file)
df4 = pq.read_table(temp_file).to_pandas(ignore_metadata=False, types_mapper=pd.ArrowDtype)
# Both give the same error
# ValueError: format number 1 of "dictionary<values=string, indices=int8, ordered=0>[pyarrow]" is not recognized

This make me think that the problem is during loading. This can be tested ignoring this error:

pq.read_table(temp_file).to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype)
# fruit    dictionary<values=string, indices=int32,

This works but the index is again on the default value int32, no more the optimal type

tested with py3.10 pd2.2.1 Glauco

adampinky85 commented 6 months ago

Thanks, I would assume it's a common case for categorical data to have low cardinality that fits within an int8.

On disk Parquet appears to store the category data as logical type String which is compressed with snappy and encoded: https://parquet.apache.org/docs/file-format/data-pages/encodings/

It's really important for our use case due to the large volume of data (billions of rows) to ensure our in-memory representation is optimal. If the team could review and provide any advice that would be much appreciated! thanks

Metadata

<pyarrow._parquet.FileMetaData object at 0x7f0c6f204630>
  created_by: parquet-cpp-arrow version 15.0.2
  num_columns: 1
  num_rows: 100000000
  num_row_groups: 96
  format_version: 2.6
  serialized_size: 11706

Schema

<pyarrow._parquet.ParquetSchema object at 0x7f0c6ec57e80>
required group field_id=-1 schema {
  optional binary field_id=-1 fruit (String);
}

Column Metadata

<pyarrow._parquet.ColumnChunkMetaData object at 0x7f0c6eccf650>
  file_offset: 264371
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 1048576
  path_in_schema: fruit
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f0c6eccf5b0>
      has_min_max: True
      min: apple
      max: cherry
      null_count: 0
      distinct_count: None
      num_values: 1048576
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 49
  total_compressed_size: 264367
  total_uncompressed_size: 264347

Column Schema

<ParquetColumnSchema>
  name: fruit
  path: fruit
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

pandas-dev / pandas