Open Gabriel-p opened 2 years ago
could you write the example using a string and io.StringIO
?
E.g. something like
import io
import pandas as pd
src = """
<contents of file go here>
"""
pd.read_csv(io.StringIO(src))
Sure, here you go
import io
import pandas as pd
src = """_x _y EDR3Name RA_ICRS DE_ICRS Plx e_Plx PM pmRA e_pmRA pmDE e_pmDE Gmag e_Gmag FG e_FG BPmag e_BPmag FBP e_FBP RPmag e_RPmag FRP e_FRP BP-RP RVDR2 e_RVDR2 BPmag-RPmag e_BPmag-RPmag probs_final
0.389828 -1.7222 "Gaia EDR3 4313229050268323840" 288.9435028448 12.66948178083 0.622 0.3531 6.271 6.232 0.259 0.698 0.269 19.366621 0.003786 337.5 0.8072 20.64046 0.052443 75.72 3.652 18.15438 0.013832 433.9 5.318 2.486078 "" "" 2.486087064793338 0.05423600870281611 0.0
0.386342 -1.721347 "Gaia EDR3 4313229054569379968" 288.93993161452 12.67034020214 -0.7574 0.2662 5.582 -1.403 0.212 -5.402 0.225 18.75413 0.003484 593.3 1.166 20.23801 0.059871 109.7 6.043 17.24599 0.026239 1002.0 23.96 2.99202 "" "" 2.9920827310213873 0.06536844753253333 0.0
0.386393 -1.72086 "Gaia EDR3 4313229054569380096" 288.93998501887 12.67082695456 0.1142 0.2587 5.249 -2.006 0.199 -4.851 0.203 18.952446 0.0035 494.3 0.9827 "" "" "" "" "" "" "" "" "" "" "" 25.3385422158 "" 0.0
0.386892 -1.721164 "Gaia EDR3 4313229054594576640" 288.94049524587 12.67052273154 0.5615 0.2662 7.077 -3.153 0.205 -6.336 0.221 18.92568 0.004866 506.6 1.872 20.339006 0.196341 99.96 18.07 17.430586 0.093665 845.1 72.85 2.908421 "" "" 2.908422475756282 0.21753767734677248 0.0
0.297794 -1.723861 "Gaia EDR3 4313229600024134912" 288.84918410399 12.66795522815 0.5447 0.2703 3.89 -1.867 0.215 -3.413 0.231 19.09285 0.003524 434.3 0.8788 20.287828 0.050999 104.8 4.914 18.010916 0.014288 495.2 6.285 2.276913 "" "" 2.2769242382494923 0.05296348366959039 0.0
0.299739 -1.720999 "Gaia EDR3 4313229600024149376" 288.85118143463 12.67081511593 -0.2938 0.4134 4.444 -0.967 0.325 -4.338 0.361 19.6194 0.004592 267.4 0.9047 21.138195 0.096252 47.88 4.243 18.113098 0.023124 450.7 9.471 3.025097 "" "" 3.0250979402018423 0.09899161859210488 0.0
0.298897 -1.723922 "Gaia EDR3 4313229600033543040" 288.85031435496 12.66789321072 -0.3439 0.5041 6.33 -2.343 0.414 -5.881 0.443 19.860334 0.00445 214.2 0.6895 21.70903 0.176578 28.3 4.602 18.377232 0.016402 353.4 5.195 3.331799 "" "" 3.331808582575391 0.17734043187478374 0.0
0.306587 -1.724078 "Gaia EDR3 4313229600033549056" 288.8581950959 12.66772714342 -0.1717 0.1787 8.312 -3.293 0.152 -7.632 0.155 18.482162 0.003595 762.2 1.621 20.764688 0.326762 67.54 20.33 16.844822 0.006603 1450.0 7.228 3.919867 "" "" 3.919837809303125 0.32682674624416913 0.0
0.306377 -1.72343 "Gaia EDR3 4313229604367380864" 288.8579806796 12.66837511571 0.5513 0.0527 5.849 -0.638 0.046 -5.814 0.047 16.594873 0.002798 4335.0 1.952 17.620792 0.005602 1222.0 5.468 15.554028 0.005777 4759.0 19.15 2.066765 "" "" 2.0667375344565713 0.008047315894240077 0.0
0.296911 -1.723959 "Gaia EDR3 4313229604367383168" 288.84827932335 12.66785882531 0.9563 0.3876 52.88 -34.179 0.293 -40.349 0.317 19.495169 0.003945 299.8 0.7797 21.306465 0.138088 41.0 5.214 18.179111 0.019083 424.1 7.307 3.127354 "" "" 3.1273542272666113 0.13940097910851343 -1.0
0.343844 -1.722793 "Gaia EDR3 4313229737463093248" 288.89637755691 12.66896041473 -0.1131 0.1686 4.585 -2.203 0.133 -4.021 0.138 18.321716 0.003034 883.6 1.033 19.79877 0.046416 164.4 7.016 17.11465 0.006921 1131.0 6.038 2.68412 "" "" 2.684164651263945 0.04692910721558392 0.0
"""
file = io.StringIO(src)
# Fails
df = pd.read_csv(file, delimiter='None')
# Works
df = pd.read_csv(file, delimiter=' ')
The error can also be reproduced with an input as simple as this one:
src = """aaa bbb
0.622 0.3531
0.622 0.3531
0.622 0.3531
0.622 0.3531
"""
Thanks @Gabriel-p for the report.
it appears that pd.read_csv(io.StringIO(src), sep=None)
raises Error: Could not determine delimiter
whereas pd.read_csv(io.StringIO(src), delimiter=None)
produces a DataFrame with one column.
According to the docs, delimiter
is an alias for sep
. The default for delimiter
is documented as None
(which is strange), whereas the default for sep
is ','
.
I suspect that explicitly specifying delimiter=None
is being ignored and treated as not passed and the default for sep
is being used (not checked this though)
Expected Behavior
The delimiter should be correctly guessed
In any case, if the seperator/delimitor cannot be determined with sep=None
, the same should be true for delimiter=None
, so the expected behavior is to raise Error: Could not determine delimiter
contributions and PRs welcome.
@simonjayhawkins sorry, the src
variable in the comment above was badly defined (it had an extra line at the top). I've fixed it now and the results are:
src = """aaa bbb
0.622 0.3531
0.622 0.3531
0.622 0.3531
0.622 0.3531
"""
file = io.StringIO(src)
# Fails
df = pd.read_csv(file, delimiter=None)
# Works
df = pd.read_csv(file, sep=None)
So sep=None
actually works, the problem is just with delimiter=None
.
@Gabriel-p @simonjayhawkins Hi! I would like to try to fix this issue.
But as I checked sep=None
raises Could not determine delimiter
while the delimeter=None
produces
aaa bbb
0 0.622 0.3531
1 0.622 0.3531
2 0.622 0.3531
3 0.622 0.3531
which is the complete opposite of the above comment.
Observations
Case 1 - when delimiter is specified as None ( no sep specified )
delimiter is set to a default value ( ,
) and hence we don't get the error of - Could not determine delimiter
Case 2 - when sep is specified as None ( no delimiter specified )
delimiter is set to None in this case ( & not default ,
) and hence we get the error Could not determine delimiter
Next steps
We may need to streamline these two cases either -
A] use default ,
for both cases
B] use None for both cases so error Could not determine delimiter
is thrown.
@simonjayhawkins : any suggestions here ? Thanks in advance
astropy
on the other hand is able to guess it correctly
just looked into this ( was wondering if pandas could implement something similar )
but seems like astropy package also doesn't guess but rather it uses default delimiter space ( ' '
)
take
Right now, tests are failing and the behavior is not as I described above, so I guess I have to wait until it is fixed in the main branch. My question is that it is not very clear to me which test file is responsible for this change, so I do not know where to add the test cases for this bugfix.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
pandas
fails when trying to guess the delimiter (space) in the test file.astropy
on the other hand is able to guess it correctlyThe test file is here pandas_test.dat.zip
Expected Behavior
The delimiter should be correctly guessed
Installed Versions
The line
pd.show_versions()
throws the following error: