pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: support decimal keyword for Float64Dtype in read_csv #52086

Open RobbertDM opened 1 year ago

RobbertDM commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
import io
pd.read_csv(io.StringIO('id\n"1,5"\n'), dtype={'id':pd.Float64Dtype()}, sep=';', decimal=',')

or

import pandas as pd
pd.read_csv("./test.csv", dtype={'id':pd.Float64Dtype()}, sep=';', decimal=',')

where test.csv looks like

id
1,5

Issue Description

I have a semicolon-separated CSV file with a bunch of floats that have decimal separation with comma. For that setup, when I specify dtype for that column as Float64Dtype(), it fails.

When I specify float from python, it works. When I use and specify . as a decimal separator, it works.

Expected Behavior

The listed example should parse and result in a float 1.5

Installed Versions

/home/robbert/.local/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-35-generic Version : #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.23.5 pytz : 2022.1 dateutil : 2.8.1 setuptools : 65.5.1 pip : 23.0.1 Cython : None pytest : 7.2.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : 2.9.5 jinja2 : 3.1.2 IPython : 8.6.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.1 pandas_gbq : None pyarrow : 10.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.4.46 tables : None tabulate : 0.9.0 xarray : None xlrd : None xlwt : None zstandard : None tzdata : 2022.7
honzavalusek commented 1 year ago

I can confirm that this bug exists on the main branch of pandas.

hxy450 commented 1 year ago

take