pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

to_csv writes wrong with NaN value #18676

Closed jackasser closed 6 years ago

jackasser commented 6 years ago

to_csv with Nan value at top row, unexpected "" in the csv file

import pandas as pd
df = pd.DataFrame([None,1,2])
df.to_csv("df.csv",header=None,index=None,encoding ='utf-8')

# df.csv
#""
#1.0
#2.0

# I want to make df.csv
#
#1.0
#2.0

Versions:

pandas.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
​
jreback commented 6 years ago

this does look like a bug.

@gfyoung ?

gfyoung commented 6 years ago

Agreed. I can also replicate this on 0.21.0 (@jackasser : I noticed that you were on 0.20.3). Investigation and PR are welcome!

Licht-T commented 6 years ago

I am working on this. This seems not the encoding issue.

df.to_csv("df.csv",header=None,index=None)
[pandas] cat df.csv                                                                                                       14:54:23  ☁  add-tuple-list-factorize-test ☂ ✭
""
1.0
2.0
Licht-T commented 6 years ago

Also, this returns the right result.

df = pd.DataFrame([1,None,1,2])
Licht-T commented 6 years ago

@gfyoung Seems that the bug or the default spec. in csv library.

Case 1

Input:

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=csv.excel)
w.writerow(['1'])
w.writerow([''])
fp.close()

Output:

1
 

Case 2

Input:

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=csv.excel)
w.writerow([''])
w.writerow(['1'])
fp.close()

Output:

""
1
Licht-T commented 6 years ago

But this works.

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None)
w.writerow(['', '1'])
w.writerow(['3', '2'])
fp.close()
,1
3,2
Licht-T commented 6 years ago

The csv library in CPython says the single empty field is quoted. https://github.com/python/cpython/blob/0b3ec192259a65971001ce8f0de85a9c1e71d9c7/Modules/_csv.c#L1244

I don't know why this is needed.

gfyoung commented 6 years ago

@Licht-T : Ah! That's very good to know. Okay, this means that this issue is out of the control of the pandas library since we do wrap Python csv library when doing this.

@jackasser : Looks like you may have hit upon a point of contention in Python's CSV library. I would raise this issue in their library by submitting an issue on their python.org website.

gfyoung commented 6 years ago

Closing because this is out of pandas control. @Licht-T : Thanks for the investigation!

jreback commented 6 years ago

@gfyoung this can be fixed on pandas side just because the csv library doesn’t do this correctly we shouldn’t be inconsistent

doesn’t passing dialect=None work

gfyoung commented 6 years ago

@jreback : to_csv uses Python's csv.writer class to write to CSV. The dialect parameter goes straight to Python's csv writer, and that AFAICT has no impact on the output in the example @Licht-T provided.

Here's the code where we initialize our writer:

https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1644-L1656

UnicodeWriter utilizes the csv.writer class behind the scenes. Thus, we are always using Python's csv library to get the job done, meaning our CSV writing is vulnerable to any issues in Python's CSV module for the time being.

gfyoung commented 6 years ago

I suppose we could hack our away around this by checking for an empty first row before writing it to CSV and replace it with a space for example, though again as I said, very hackish IMO.

jreback commented 6 years ago

try passing dialect=None when we pass to the csvwriter

Licht-T commented 6 years ago

@jreback I already tried, but the result is same.

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None)
w.writerow([''])
w.writerow(['1'])
fp.close()
""
1
Licht-T commented 6 years ago

The only parameter that makes some impact is quoting.

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None, quoting=csv.QUOTE_NONE)
w.writerow([''])
w.writerow(['1'])
fp.close()
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-56-97051671206d> in <module>()
      2 fp = open('test.csv', 'w')
      3 w = csv.writer(fp, dialect=csv.excel,quoting=csv.QUOTE_NONE)
----> 4 w.writerow([''])
      5 w.writerow(['1'])
      6 fp.close()

Error: single empty field record must be quoted
Licht-T commented 6 years ago

Well..., this csv library behavior is very inconsistent and should be fixed in CPython. Or, is there any reason to do such behavior...? https://github.com/python/cpython/blame/0b3ec192259a65971001ce8f0de85a9c1e71d9c7/Modules/_csv.c#L1244

jreback commented 6 years ago

@Licht-T certainly can file a bug report there.

ok I guess no easy way to fix this here, however, maybe we should add a small note in the code about this?

jackasser commented 6 years ago

@jreback @Licht-T @gfyoung thank you to considering about it! I understand that this problem should solve in Python's CSV library. But I think we should add a small note in the code.

Licht-T commented 6 years ago

@jreback Created the issue on CPython. https://bugs.python.org/issue32255

I'll add the small note on pandas. Where should I add the note?

jackasser commented 6 years ago

@Licht-T thank you for creating issue!

https://github.com/pandas-dev/pandas/blob/ba3a442556e3c04e316f790335427e6d983db01d/pandas/core/frame.py#L1522 ↑ I think here. ↓ and this page http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv

Licht-T commented 6 years ago

@jackasser @jreback @gfyoung Actually, the double quoted blank field is the default spec. when writing single column CSV. IOW, this issue is the correct behavior, but the "Case 1" in https://github.com/pandas-dev/pandas/issues/18676#issuecomment-350205767 was wrong behavior. This is now fixed in CPython and the patch is backported to CPython 3.6. https://github.com/python/cpython/pull/4769

Please note that this bug does not exist in CPython 2.7.

jreback commented 6 years ago

@Licht-T ok can you add a test for >= 3.6 only, and xfail it for now (as not sure which release its on, though maybe its actually out?).

Licht-T commented 6 years ago

@jreback Okay! (That fix is not released yet, will be included in the next release of CPython 3.6.)