pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.95k stars 18.04k forks source link

to_csv writes wrong utf-16 #10755

Open jehuelsm opened 9 years ago

jehuelsm commented 9 years ago

Writing a DataFrame with utf-16 encoding adds garbage characters to the file:

import codecs
import pandas as pd

#This works
enc = 'utf-8'

print '\n\n',enc,'\n\n'

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df,'\n\n'

df.to_csv('foo.csv', encoding = enc)

df = pd.read_csv('foo.csv', encoding = enc)
print df,'\n\n'

with codecs.open('foo.csv', encoding=enc, mode='r') as f:
    for line in f:
        print line

#this does not work       
enc = 'utf-16'
print '\n\n',enc,'\n\n'

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df,'\n\n'

df.to_csv('foo.csv', encoding = enc)

df = pd.read_csv('foo.csv', encoding = enc)
print df,'\n\n'

#prints
#  Unnamed: 0  one  two???????????
#  0          b    2  2.0???????????
#  1          d  NaN            4.0? 

with codecs.open('foo.csv', encoding=enc, mode='r') as f:
    for line in f:
        print line
#prints
#,one,two਍愀Ⰰ㄀⸀ Ⰰ㄀⸀ ഀ
#b,2.0,2.0਍挀Ⰰ㌀⸀ Ⰰ㌀⸀ ഀ
#d,,4.0਍

Versions:

pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: de_DE

pandas: 0.16.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.8.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.3
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: 2.1.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.4.1
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: None
jreback commented 9 years ago

This prints ok on mac, and on 64-bit windows. This is just using the standard python UnicodeWriter. so give a try with just writing directly and see what happens.

jreback commented 9 years ago

You know that windows is a bit odd around unicode anyhow, e.g. http://stackoverflow.com/questions/13095499/unicode-in-python-just-utf-16