BUG: pyarrow stripping leading zeros with dtype=str

dadrake3 commented 8 months ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

x = """
    AB|000388907|abc|0150
    AB|101044572|abc|0150
    AB|000023607|abc|0205
    AB|100102040|abc|0205
"""

df_arrow = pd.read_csv(
    StringIO(x),
    delimiter="|",
    header=None,
    dtype=str,
    engine="pyarrow",
    keep_default_na=False,
)

df_python = pd.read_csv(
    StringIO(x),
    delimiter="|",
    header=None,
    dtype=str,
    engine="python",
    keep_default_na=False,
)

df_arrow
        0          1    2    3
0      AB     388907  abc  150
1      AB  101044572  abc  150
2      AB      23607  abc  205
3      AB  100102040  abc  205

df_python
        0          1    2     3
0      AB  000388907  abc  0150
1      AB  101044572  abc  0150
2      AB  000023607  abc  0205
3      AB  100102040  abc  0205

Issue Description

when I use engine=pyarrow and set dtype to str i am seeing the leading zeros in my numeric columns removed even though the resulting column type is 'O'. When I use the python engine I see that the leading zeros are still there as expected.

Expected Behavior

I would expect when treating all columns as strings that the leading zeros are retained and the data is unmodified.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.8.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-17-generic Version : #17-Ubuntu SMP PREEMPT_DYNAMIC Thu Jan 11 14:20:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.1.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.27 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

benjaminbauer commented 8 months ago

can confirm this in 2.2.1

mroeschke commented 8 months ago

Do you get the same result when you use the pyarrow.csv.read_csv method directly? https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html

benjaminbauer commented 8 months ago

seems to be a problem with the pyarrow engine in pandas (pandas 2.2.1, pyarrow 15.0.1)

pandas engine pandas dtype: 01
pyarrow engine pandas dtype: 1
pandas engine pyarrow dtype: 01
pyarrow engine pyarrow dtype: 1
pyarrow native: 01

import io

import pandas as pd
import pyarrow as pa
from pyarrow import csv

csv_file = io.BytesIO(
    """a
01""".encode()
)

print(f"pandas engine pandas dtype: {pd.read_csv(csv_file, dtype=str).iloc[0,0]}")

csv_file.seek(0)
print(
    f"pyarrow engine pandas dtype: {pd.read_csv(csv_file, dtype=str, engine='pyarrow').iloc[0,0]}"
)

csv_file.seek(0)
print(
    f"pandas engine pyarrow dtype: {pd.read_csv(csv_file, dtype='str[pyarrow]').iloc[0,0]}"
)

csv_file.seek(0)
print(
    f"pyarrow engine pyarrow dtype: {pd.read_csv(csv_file, dtype='str[pyarrow]', engine='pyarrow').iloc[0,0]}"
)

csv_file.seek(0)
convert_options = csv.ConvertOptions(column_types={"a": pa.string()})
print(
    f"pyarrow native: {csv.read_csv(csv_file, convert_options=convert_options).column(0).to_pylist()[0]}"
)

kristinburg commented 7 months ago

take

jorisvandenbossche commented 7 months ago

For context, the reason this is happening is because currrently the dtype argument is only handled as post-processing after pyarrow has read the CSV file. So, currently, we let pyarrow read the csv file with inferring (in this case) numerical types, and then afterwards we cast the result to the specified dtype (in this case str). So that explains why the leading zeros are lost.

https://github.com/pandas-dev/pandas/blob/e51039afe3cbdedbf5ffd5cefb5dea98c2050b88/pandas/io/parsers/arrow_parser_wrapper.py#L216

PyArrow does provide a column_types keyword to specify the dtype while reading: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions

So what we need to do is translate the dtype keyword in read_csv to the column_types argument for pyarrow, somewhere here:

https://github.com/pandas-dev/pandas/blob/e51039afe3cbdedbf5ffd5cefb5dea98c2050b88/pandas/io/parsers/arrow_parser_wrapper.py#L120-L134

As a first step, I would just try to enable specifying it for a specific column, like dtype={"a": str}, because for something like dtype=str which applies to all columns, with the current pyarrow API you would need to know the column names up front (pyarrow only accepts specifying it per column, not for all columns at once)

pandas-dev / pandas