BUG: reading long csv with high numeric values displays a "mixed types" message

MayeulC-ST commented 3 weeks ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Load csv attached to this issue
import pandas as pd
df = pd.read_csv("pandas-bug-reproducer.csv", header=0, index_col=False)

Issue Description

The read_csv command results in the following message (this is ipython output, but it also happens non-interactively)

<ipython-input-61-2957767dea3a>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.

Column 7 is then imported as strings, not floats.

I can work around this by using the methods in the hint, but this smells like a bug, as if I remove any line in the CSV, the issue disappears. If I replace the last line by a copy-paste of the one before, the bug also goes away.

It is quite tricky to create a small reproducer, so I am attaching the file here. Replacing all text with "a" and values with "1" kept the issue, while making the data anonymous and very compressible: pandas-bug-reproducer.zip

Expected Behavior

This message should not appear, and the data in column 7 should be imported as floating point values.

Moreover, changing the input csv by adding or removing random lines should not affect pandas's behavior.

Installed Versions

First version I tried

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.4.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-553.16.1.el8_10.x86_64 Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.2.0 pip : 24.2 Cython : 3.0.2 pytest : 8.2.2 hypothesis : None sphinx : 7.3.7 blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : 4.9.3 html5lib : None pymysql : 1.0.2 psycopg2 : None jinja2 : 3.1.4 IPython : 8.4.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.7.3 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.1 sqlalchemy : 2.0.15 tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Second version I tried

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.2 python-bits : 64 OS : Linux OS-release : 4.18.0-553.16.1.el8_10.x86_64 Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.2 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.0 Cython : None sphinx : 8.1.3 IPython : 8.28.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Liam3851 commented 3 weeks ago

I can work around this by using the methods in the hint, but this smells like a bug, as if I remove any line in the CSV, the issue disappears. If I replace the last line by a copy-paste of the one before, the bug also goes away.

This is a red herring: your file is exactly 65,537 lines long. Because your file has 8 columns, the chunk size of the CSV parser is 2 ** 16 or 65,536 lines.

When you delete a row, you are parsing it in one chunk, and do not receive the warning. When you add the row back, you are parsing it in two chunks, raising the warning about the ambiguous dtypes.

You should not see the dtype changing with or without the extra row added, but I indeed do not in your reproducer.

This message should not appear, and the data in column 7 should be imported as floating point values.

It appears your floating point values are greater than 10**20 but not in scientific notation. I can reproduce your parsing as string for strings with such high values:

In [225]: pd.read_csv(io.StringIO('1' * 20 + '.1'), header=None).dtypes
Out[225]:
0    float64
dtype: object

In [226]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None).dtypes
Out[226]:
0    object
dtype: object

My guess is here this has something to do with the value represented being > 2 ** 64, which would be the highest possible integer.

In [233]: pd.read_csv(io.StringIO('1' * 19), header=None).dtypes
Out[233]:
0    int64
dtype: object

In [234]: pd.read_csv(io.StringIO('1' * 20), header=None).dtypes
Out[234]:
0    uint64
dtype: object

In [235]: pd.read_csv(io.StringIO('1' * 21), header=None).dtypes
Out[235]:
0    object
dtype: object

Now obviously you can represent floats of greater value than this in non-scientific notation. So then the question is whether read_csv should be able to parse floats with values this high as floats. Given you're specifying many more significant figures than can be represented in a 64-bit floating point value, though, there's an argument to be made that pandas is doing you a favor-- if you really need to retain 32 digits of data, you can't do that in an np.float64, you need a string. And if you want a float, you can always specify the dtype and get exactly what you want:

In [238]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None, dtype=np.float64).dtypes
Out[238]:
0    float64
dtype: object

In [239]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None, dtype=np.float64)
Out[239]:
              0
0  1.111111e+20

MayeulC-ST commented 3 weeks ago

All right, thank you for your answer, that makes sense. I had noticed the very suspicious 65538 lines, and forgot to mention it in the ticket, but could not figure how it was related to the number of columns. The number of columns affecting the chunk size makes sense.

If I understand this right, the warning is being displayed as the last column in the first chunk is parsed as str, and it is parsed as float in the second chunk.

Then, here is a smaller, self-contained reproducer based on your approach:

>>> import pandas as pd
>>> import io
>>> inputstr = ('0,'*7 + '1' * 21+'\n')*65536 + ('0,'*7+'1\n')
>>> df = pd.read_csv(io.StringIO(inputstr), header=None)
<stdin>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.

Notice how the message goes away if you change 65536 to 65535 (now parsed as a single block), or 65537 (now the second block is also parsed as str).

I agree that this is a delicate issue to address. The message could be improved, a line could be added in the documentation, or perhaps people will find this issue thread when looking up the issue online. I will let you decide on the best way to move forward, do not hesitate to close this issue.

I could help with the documentation on my spare time, if that's the best way forward.

As an aside, for this specific processing script, values should not have been that high in the first place, this is an issue on the generation side, so I do not care much about accuracy (and was able to parse them as fp64 by specifying the datatype). The script I initially wrote was working fine until a csv file with these huge values came up -- and then crashed due to numeric operation failing on strings.

rhshadrach commented 2 weeks ago

The message could be improved, a line could be added in the documentation

It's not clear to me how this could be improved in a concise way. The OP mentioned that they were able to resolve with the current message and it seemed to be without difficulty. Suggestions here are welcome, but I'm going to mark this as a closing candidate for now, and if none materialize, I think this can be closed.

pandas-dev / pandas