Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
I have encountered an issue with the read_csv() function in pandas when using the pyarrow engine. Even when specifying dtype=str, pure numeric strings are being converted to numeric type. Additionally, pure numeric strings starting with multiple zeros lose the leading zeros in the resulting DataFrame. This behavior is unexpected as I would like to preserve the original format of the numeric strings as text.
Expected Behavior
The example demonstrates reading a CSV file with different engines. It is expected that pyarrow engine should get the same DataFrame as c engine and python engine when using dtype=str. It should output the following texts.
# This is read by pyarrow engine.
sample_name case_id sample_id sample_type
0 sample_1 1 1001 T
1 sample_2 2 1002 T
# This is read by c engine.
sample_name case_id sample_id sample_type
0 sample_1 00001 00001001 T
1 sample_2 00002 00001002 T
# This is read by python engine.
sample_name case_id sample_id sample_type
0 sample_1 00001 00001001 T
1 sample_2 00002 00001002 T
Pandas version checks
[ ] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I have encountered an issue with the
read_csv(
) function in pandas when using the pyarrow engine. Even when specifyingdtype=str
, pure numeric strings are being converted to numeric type. Additionally, pure numeric strings starting with multiple zeros lose the leading zeros in the resulting DataFrame. This behavior is unexpected as I would like to preserve the original format of the numeric strings as text.Expected Behavior
The example demonstrates reading a CSV file with different engines. It is expected that
pyarrow
engine should get the same DataFrame asc
engine andpython
engine when usingdtype=str
. It should output the following texts.Installed Versions