Closed jackasser closed 6 years ago
this does look like a bug.
@gfyoung ?
Agreed. I can also replicate this on 0.21.0
(@jackasser : I noticed that you were on 0.20.3
). Investigation and PR are welcome!
I am working on this. This seems not the encoding issue.
df.to_csv("df.csv",header=None,index=None)
[pandas] cat df.csv 14:54:23 ☁ add-tuple-list-factorize-test ☂ ✭
""
1.0
2.0
Also, this returns the right result.
df = pd.DataFrame([1,None,1,2])
@gfyoung Seems that the bug or the default spec. in csv
library.
Input:
import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=csv.excel)
w.writerow(['1'])
w.writerow([''])
fp.close()
Output:
1
Input:
import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=csv.excel)
w.writerow([''])
w.writerow(['1'])
fp.close()
Output:
""
1
But this works.
import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None)
w.writerow(['', '1'])
w.writerow(['3', '2'])
fp.close()
,1
3,2
The csv
library in CPython says the single empty field is quoted.
https://github.com/python/cpython/blob/0b3ec192259a65971001ce8f0de85a9c1e71d9c7/Modules/_csv.c#L1244
I don't know why this is needed.
@Licht-T : Ah! That's very good to know. Okay, this means that this issue is out of the control of the pandas
library since we do wrap Python csv
library when doing this.
@jackasser : Looks like you may have hit upon a point of contention in Python's CSV library. I would raise this issue in their library by submitting an issue on their python.org website.
Closing because this is out of pandas
control. @Licht-T : Thanks for the investigation!
@gfyoung this can be fixed on pandas side just because the csv library doesn’t do this correctly we shouldn’t be inconsistent
doesn’t passing dialect=None work
@jreback : to_csv
uses Python's csv.writer
class to write to CSV. The dialect
parameter goes straight to Python's csv
writer, and that AFAICT has no impact on the output in the example @Licht-T provided.
Here's the code where we initialize our writer:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1644-L1656
UnicodeWriter
utilizes the csv.writer
class behind the scenes. Thus, we are always using Python's csv
library to get the job done, meaning our CSV writing is vulnerable to any issues in Python's CSV module for the time being.
I suppose we could hack our away around this by checking for an empty first row before writing it to CSV and replace it with a space for example, though again as I said, very hackish IMO.
try passing dialect=None
when we pass to the csvwriter
@jreback I already tried, but the result is same.
import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None)
w.writerow([''])
w.writerow(['1'])
fp.close()
""
1
The only parameter that makes some impact is quoting
.
import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None, quoting=csv.QUOTE_NONE)
w.writerow([''])
w.writerow(['1'])
fp.close()
---------------------------------------------------------------------------
Error Traceback (most recent call last)
<ipython-input-56-97051671206d> in <module>()
2 fp = open('test.csv', 'w')
3 w = csv.writer(fp, dialect=csv.excel,quoting=csv.QUOTE_NONE)
----> 4 w.writerow([''])
5 w.writerow(['1'])
6 fp.close()
Error: single empty field record must be quoted
Well..., this csv
library behavior is very inconsistent and should be fixed in CPython.
Or, is there any reason to do such behavior...?
https://github.com/python/cpython/blame/0b3ec192259a65971001ce8f0de85a9c1e71d9c7/Modules/_csv.c#L1244
@Licht-T certainly can file a bug report there.
ok I guess no easy way to fix this here, however, maybe we should add a small note in the code about this?
@jreback @Licht-T @gfyoung thank you to considering about it! I understand that this problem should solve in Python's CSV library. But I think we should add a small note in the code.
@jreback Created the issue on CPython. https://bugs.python.org/issue32255
I'll add the small note on pandas. Where should I add the note?
@Licht-T thank you for creating issue!
https://github.com/pandas-dev/pandas/blob/ba3a442556e3c04e316f790335427e6d983db01d/pandas/core/frame.py#L1522 ↑ I think here. ↓ and this page http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv
@jackasser @jreback @gfyoung Actually, the double quoted blank field is the default spec. when writing single column CSV. IOW, this issue is the correct behavior, but the "Case 1" in https://github.com/pandas-dev/pandas/issues/18676#issuecomment-350205767 was wrong behavior. This is now fixed in CPython and the patch is backported to CPython 3.6. https://github.com/python/cpython/pull/4769
Please note that this bug does not exist in CPython 2.7.
@Licht-T ok can you add a test for >= 3.6 only, and xfail it for now (as not sure which release its on, though maybe its actually out?).
@jreback Okay! (That fix is not released yet, will be included in the next release of CPython 3.6.)
to_csv with Nan value at top row, unexpected "" in the csv file
Versions: