Closed wesm closed 12 years ago
I'm hitting this issue right now with the MovieLens 100k dataset, which uses the iso8859_2 encoding (as inferred by chardet).
My issue: when I call df.to_string, I can pass "force_unicode=True". However, I am not sure how to set force_unicode=True for all calls to to_string, e.g. any time repr() is called on df, which occurs when printing the df to the shell.
Gnarly issue. Character encodings in Python are never fun.
@hammer : Is this with the current development version of pandas? Can you post the traceback somewhere?
@hammer I'm able to reproduce the issue on the movielens data. If you pass encoding='iso8859_2'
when you use read_csv
everything works fine.
In [8]: df = read_csv('/Users/wesm/code/pandas/pandas/tests/unicode_series.csv', header=None, encoding='iso8859_2')
In [9]: df
Out[9]:
X.1 X.2
0 1582 Invitation, The (Zaproszenie) (1986)
1 1583 Symphonie pastorale, La (1946)
2 1584 American Dream (1990)
3 1585 Lashou shentan (1992)
4 1586 Terror in a Texas Town (1958)
5 1587 Salut cousin! (1996)
6 1588 Schizopolis (1996)
7 1589 To Have, or Not (1995)
8 1590 Duoluo tianshi (1995)
9 1591 Magic Hour, The (1998)
10 1592 Death in Brunswick (1991)
11 1593 Everest (1998)
12 1594 Shopping (1994)
13 1595 Nemesis 2: Nebula (1995)
14 1596 Romper Stomper (1992)
15 1597 City of Industry (1997)
16 1598 Someone Else's America (1995)
17 1599 Guantanamera (1994)
18 1600 Office Killer (1997)
19 1601 Price Above Rubies, A (1998)
20 1602 Angela (1995)
21 1603 He Walked by Night (1948)
22 1604 Love Serenade (1996)
23 1605 Deceiver (1997)
24 1606 Hurricane Streets (1998)
25 1607 Buddy (1997)
26 1608 B*A*P*S (1997)
27 1609 Truth or Consequences, N.M. (1997)
28 1610 Intimate Relations (1996)
29 1611 Leading Man, The (1996)
30 1612 Tokyo Fist (1995)
31 1613 Reluctant Debutante, The (1958)
32 1614 Warriors of Virtue (1997)
33 1615 Desert Winds (1995)
34 1616 Hugo Pool (1997)
35 1617 King of New York (1990)
36 1618 All Things Fair (1996)
37 1619 Sixth Man, The (1997)
38 1620 Butterfly Kiss (1995)
39 1621 Paris, France (1993)
40 1622 Cérémonie, La (1995)
41 1623 Hush (1998)
42 1624 Nightwatch (1997)
43 1625 Nobody Loves Me (Keiner liebt mich) (1994)
44 1626 Wife, The (1995)
45 1627 Lamerica (1994)
46 1628 Nico Icon (1995)
47 1629 Silence of the Palace, The (Saimt el Qusur) (1994)
48 1630 Slingshot, The (1993)
49 1631 Land and Freedom (Tierra y libertad) (1995)
50 1632 Á köldum klaka (Cold Fever) (1994)
51 1633 Etz Hadomim Tafus (Under the Domin Tree) (1994)
52 1634 Two Friends (1986)
53 1635 Brothers in Trouble (1995)
54 1636 Girls Town (1996)
55 1637 Normal Life (1996)
56 1638 Bitter Sugar (Azucar Amargo) (1996)
57 1639 Eighth Day, The (1996)
58 1640 Dadetown (1995)
59 1641 Some Mother's Son (1996)
60 1642 Angel Baby (1995)
61 1643 Sudden Manhattan (1996)
62 1644 Butcher Boy, The (1998)
63 1645 Men With Guns (1997)
64 1646 Hana-bi (1997)
65 1647 Niagara, Niagara (1997)
66 1648 Big One, The (1997)
67 1649 Butcher Boy, The (1998)
68 1650 Spanish Prisoner, The (1997)
69 1651 Temptress Moon (Feng Yue) (1996)
70 1652 Entertaining Angels: The Dorothy Day Story (1996)
71 1653 Chairman of the Board (1998)
72 1654 Favor, The (1994)
73 1655 Little City (1998)
74 1656 Target (1995)
75 1657 Substance of Fire, The (1996)
76 1658 Getting Away With Murder (1996)
77 1659 Small Faces (1995)
78 1660 New Age, The (1994)
79 1661 Rough Magic (1995)
80 1662 Nothing Personal (1995)
81 1663 8 Heads in a Duffel Bag (1997)
82 1664 Brother's Kiss, A (1997)
83 1665 Ripe (1996)
84 1666 Next Step, The (1995)
85 1667 Wedding Bell Blues (1996)
86 1668 MURDER and murder (1996)
87 1669 Tainted (1998)
88 1670 Further Gesture, A (1996)
89 1671 Kika (1993)
90 1672 Mirage (1995)
91 1673 Mamma Roma (1962)
92 1674 Sunchaser, The (1996)
93 1675 War at Home, The (1996)
94 1676 Sweet Nothing (1995)
95 1677 Mat' i syn (1997)
96 1678 B. Monkey (1998)
97 1679 Sliding Doors (1998)
98 1680 You So Crazy (1994)
99 1681 Scream of Stone (Schrei aus Stein) (1991)
I'll see about doing this automatically with chardet or some way to modify the repr code to not blow up with a UnicodeError
ok @hammer I think I have this sorted out. If you don't specify the encoding it will not blow up anymore:
In [3]: df = read_csv('pandas/tests/unicode_series.csv', header=None)
In [4]: df
Out[4]:
X.1 X.2
0 1617 King of New York (1990)
1 1618 All Things Fair (1996)
2 1619 Sixth Man, The (1997)
3 1620 Butterfly Kiss (1995)
4 1621 Paris, France (1993)
5 1622 C?r?monie, La (1995)
6 1623 Hush (1998)
7 1624 Nightwatch (1997)
8 1625 Nobody Loves Me (Keiner liebt mich) (1994)
9 1626 Wife, The (1995)
10 1627 Lamerica (1994)
11 1628 Nico Icon (1995)
12 1629 Silence of the Palace, The (Saimt el Qusur) (1994)
13 1630 Slingshot, The (1993)
14 1631 Land and Freedom (Tierra y libertad) (1995)
15 1632 ? k?ldum klaka (Cold Fever) (1994)
16 1633 Etz Hadomim Tafus (Under the Domin Tree) (1994)
17 1634 Two Friends (1986)
but if you do, it will render the Unicode correctly in the console.
In [5]: df = read_csv('pandas/tests/unicode_series.csv', header=None, encoding='iso-8859-2')
In [6]: df
Out[6]:
X.1 X.2
0 1617 King of New York (1990)
1 1618 All Things Fair (1996)
2 1619 Sixth Man, The (1997)
3 1620 Butterfly Kiss (1995)
4 1621 Paris, France (1993)
5 1622 Cérémonie, La (1995)
6 1623 Hush (1998)
7 1624 Nightwatch (1997)
8 1625 Nobody Loves Me (Keiner liebt mich) (1994)
9 1626 Wife, The (1995)
10 1627 Lamerica (1994)
11 1628 Nico Icon (1995)
12 1629 Silence of the Palace, The (Saimt el Qusur) (1994)
13 1630 Slingshot, The (1993)
14 1631 Land and Freedom (Tierra y libertad) (1995)
15 1632 Á köldum klaka (Cold Fever) (1994)
16 1633 Etz Hadomim Tafus (Under the Domin Tree) (1994)
17 1634 Two Friends (1986)
Short of shipping chardet I don't know if there's a way to automatically infer the encoding
However this broke Python 3 tests. leaving issue open
I'll look into it with Python 3.
I get one failure, with reading the newly added CSV file. pandas.core.common._get_handle
(https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L651) returns a text-mode file handle in Python 3, and the default behaviour is to throw errors if the file can't be decoded with the (platform dependent) default encoding. Specifying errors="replace"
gets the same behaviour as in Python 2 here (unknown characters replaced with �), but I've not looked at where else _get_handle
is used.
here are lines