pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

BUG: read_csv not guessing delimiter #47024

Open Gabriel-p opened 2 years ago

Gabriel-p commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
file = "pandas_test.dat"

# Fails
df = pd.read_csv(file, delimiter=None)

# Works
from astropy.io import ascii
df = ascii.read(file)

Issue Description

pandas fails when trying to guess the delimiter (space) in the test file. astropy on the other hand is able to guess it correctly

The test file is here pandas_test.dat.zip

Expected Behavior

The delimiter should be correctly guessed

Installed Versions

``` pd.__version__ '1.4.2' ```

The line pd.show_versions() throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/pandas/util/_print_versions.py", line 109, in show_versions
    deps = _get_dependency_info()
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/pandas/util/_print_versions.py", line 88, in _get_dependency_info
    mod = import_optional_dependency(modname, errors="ignore")
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/pandas/compat/_optional.py", line 138, in import_optional_dependency
    module = importlib.import_module(name)
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/setuptools/__init__.py", line 8, in <module>
    import _distutils_hack.override  # noqa: F401
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/_distutils_hack/override.py", line 1, in <module>
    __import__('_distutils_hack').do_override()
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/_distutils_hack/__init__.py", line 72, in do_override
    ensure_local_distutils()
  File "/home/gabriel/miniconda3/envs/asteca/lib/python3.9/site-packages/_distutils_hack/__init__.py", line 59, in ensure_local_distutils
    assert '_distutils' in core.__file__, core.__file__
AssertionError: /home/gabriel/miniconda3/envs/asteca/lib/python3.9/distutils/core.py
MarcoGorelli commented 2 years ago

could you write the example using a string and io.StringIO?

E.g. something like

import io
import pandas as pd

src = """
<contents of file go here>
"""
pd.read_csv(io.StringIO(src))
Gabriel-p commented 2 years ago

Sure, here you go

import io
import pandas as pd

src = """_x _y EDR3Name RA_ICRS DE_ICRS Plx e_Plx PM pmRA e_pmRA pmDE e_pmDE Gmag e_Gmag FG e_FG BPmag e_BPmag FBP e_FBP RPmag e_RPmag FRP e_FRP BP-RP RVDR2 e_RVDR2 BPmag-RPmag e_BPmag-RPmag probs_final
0.389828 -1.7222 "Gaia EDR3 4313229050268323840" 288.9435028448 12.66948178083 0.622 0.3531 6.271 6.232 0.259 0.698 0.269 19.366621 0.003786 337.5 0.8072 20.64046 0.052443 75.72 3.652 18.15438 0.013832 433.9 5.318 2.486078 "" "" 2.486087064793338 0.05423600870281611 0.0
0.386342 -1.721347 "Gaia EDR3 4313229054569379968" 288.93993161452 12.67034020214 -0.7574 0.2662 5.582 -1.403 0.212 -5.402 0.225 18.75413 0.003484 593.3 1.166 20.23801 0.059871 109.7 6.043 17.24599 0.026239 1002.0 23.96 2.99202 "" "" 2.9920827310213873 0.06536844753253333 0.0
0.386393 -1.72086 "Gaia EDR3 4313229054569380096" 288.93998501887 12.67082695456 0.1142 0.2587 5.249 -2.006 0.199 -4.851 0.203 18.952446 0.0035 494.3 0.9827 "" "" "" "" "" "" "" "" "" "" "" 25.3385422158 "" 0.0
0.386892 -1.721164 "Gaia EDR3 4313229054594576640" 288.94049524587 12.67052273154 0.5615 0.2662 7.077 -3.153 0.205 -6.336 0.221 18.92568 0.004866 506.6 1.872 20.339006 0.196341 99.96 18.07 17.430586 0.093665 845.1 72.85 2.908421 "" "" 2.908422475756282 0.21753767734677248 0.0
0.297794 -1.723861 "Gaia EDR3 4313229600024134912" 288.84918410399 12.66795522815 0.5447 0.2703 3.89 -1.867 0.215 -3.413 0.231 19.09285 0.003524 434.3 0.8788 20.287828 0.050999 104.8 4.914 18.010916 0.014288 495.2 6.285 2.276913 "" "" 2.2769242382494923 0.05296348366959039 0.0
0.299739 -1.720999 "Gaia EDR3 4313229600024149376" 288.85118143463 12.67081511593 -0.2938 0.4134 4.444 -0.967 0.325 -4.338 0.361 19.6194 0.004592 267.4 0.9047 21.138195 0.096252 47.88 4.243 18.113098 0.023124 450.7 9.471 3.025097 "" "" 3.0250979402018423 0.09899161859210488 0.0
0.298897 -1.723922 "Gaia EDR3 4313229600033543040" 288.85031435496 12.66789321072 -0.3439 0.5041 6.33 -2.343 0.414 -5.881 0.443 19.860334 0.00445 214.2 0.6895 21.70903 0.176578 28.3 4.602 18.377232 0.016402 353.4 5.195 3.331799 "" "" 3.331808582575391 0.17734043187478374 0.0
0.306587 -1.724078 "Gaia EDR3 4313229600033549056" 288.8581950959 12.66772714342 -0.1717 0.1787 8.312 -3.293 0.152 -7.632 0.155 18.482162 0.003595 762.2 1.621 20.764688 0.326762 67.54 20.33 16.844822 0.006603 1450.0 7.228 3.919867 "" "" 3.919837809303125 0.32682674624416913 0.0
0.306377 -1.72343 "Gaia EDR3 4313229604367380864" 288.8579806796 12.66837511571 0.5513 0.0527 5.849 -0.638 0.046 -5.814 0.047 16.594873 0.002798 4335.0 1.952 17.620792 0.005602 1222.0 5.468 15.554028 0.005777 4759.0 19.15 2.066765 "" "" 2.0667375344565713 0.008047315894240077 0.0
0.296911 -1.723959 "Gaia EDR3 4313229604367383168" 288.84827932335 12.66785882531 0.9563 0.3876 52.88 -34.179 0.293 -40.349 0.317 19.495169 0.003945 299.8 0.7797 21.306465 0.138088 41.0 5.214 18.179111 0.019083 424.1 7.307 3.127354 "" "" 3.1273542272666113 0.13940097910851343 -1.0
0.343844 -1.722793 "Gaia EDR3 4313229737463093248" 288.89637755691 12.66896041473 -0.1131 0.1686 4.585 -2.203 0.133 -4.021 0.138 18.321716 0.003034 883.6 1.033 19.79877 0.046416 164.4 7.016 17.11465 0.006921 1131.0 6.038 2.68412 "" "" 2.684164651263945 0.04692910721558392 0.0
"""
file = io.StringIO(src)

# Fails
df = pd.read_csv(file, delimiter='None')

# Works
df = pd.read_csv(file, delimiter=' ')

The error can also be reproduced with an input as simple as this one:

src = """aaa bbb
0.622 0.3531
0.622 0.3531
0.622 0.3531
0.622 0.3531
"""
simonjayhawkins commented 2 years ago

Thanks @Gabriel-p for the report.

it appears that pd.read_csv(io.StringIO(src), sep=None) raises Error: Could not determine delimiter whereas pd.read_csv(io.StringIO(src), delimiter=None) produces a DataFrame with one column.

According to the docs, delimiter is an alias for sep. The default for delimiter is documented as None (which is strange), whereas the default for sep is ','.

I suspect that explicitly specifying delimiter=None is being ignored and treated as not passed and the default for sep is being used (not checked this though)

Expected Behavior

The delimiter should be correctly guessed

In any case, if the seperator/delimitor cannot be determined with sep=None, the same should be true for delimiter=None, so the expected behavior is to raise Error: Could not determine delimiter

contributions and PRs welcome.

Gabriel-p commented 2 years ago

@simonjayhawkins sorry, the src variable in the comment above was badly defined (it had an extra line at the top). I've fixed it now and the results are:

src = """aaa bbb
0.622 0.3531
0.622 0.3531
0.622 0.3531
0.622 0.3531
"""
file = io.StringIO(src)

# Fails
df = pd.read_csv(file, delimiter=None)

# Works
df = pd.read_csv(file, sep=None)

So sep=None actually works, the problem is just with delimiter=None.

dimitra-karadima commented 2 years ago

@Gabriel-p @simonjayhawkins Hi! I would like to try to fix this issue. But as I checked sep=None raises Could not determine delimiterwhile the delimeter=None produces

       aaa bbb
0  0.622 0.3531
1  0.622 0.3531
2  0.622 0.3531
3  0.622 0.3531

which is the complete opposite of the above comment.

dannyi96 commented 2 years ago

Observations

Case 1 - when delimiter is specified as None ( no sep specified )

https://github.com/pandas-dev/pandas/blob/b0237ed682fd226e6c76327e6524b54ea7ace9a3/pandas/io/parsers/readers.py#L2080-L2084

delimiter is set to a default value ( , ) and hence we don't get the error of - Could not determine delimiter

Case 2 - when sep is specified as None ( no delimiter specified )

https://github.com/pandas-dev/pandas/blob/b0237ed682fd226e6c76327e6524b54ea7ace9a3/pandas/io/parsers/readers.py#L2063-L2065

delimiter is set to None in this case ( & not default , ) and hence we get the error Could not determine delimiter

Next steps

We may need to streamline these two cases either - A] use default , for both cases B] use None for both cases so error Could not determine delimiter is thrown.

@simonjayhawkins : any suggestions here ? Thanks in advance

dannyi96 commented 2 years ago

astropy on the other hand is able to guess it correctly

just looked into this ( was wondering if pandas could implement something similar ) but seems like astropy package also doesn't guess but rather it uses default delimiter space ( ' ')

https://github.com/astropy/astropy/blob/8262a315917ca51a01e4cbe63fcf9000048309dd/astropy/io/ascii/basic.py#L48-L56

dimitra-karadima commented 2 years ago

take

dimitra-karadima commented 2 years ago

Right now, tests are failing and the behavior is not as I described above, so I guess I have to wait until it is fixed in the main branch. My question is that it is not very clear to me which test file is responsible for this change, so I do not know where to add the test cases for this bugfix.