pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

Unicode repr failure in DataFrame #795

Closed wesm closed 12 years ago

wesm commented 12 years ago
In [9]: df = read_clipboard(header=None, sep='\s+')

In [10]: df
Out[10]: ---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/Users/wesm/<ipython-input-10-7ed0097d7e9e> in <module>()
----> 1 df

/Users/wesm/code/repos/ipython/IPython/core/displayhook.pyc in __call__(self, result)
    236             self.start_displayhook()
    237             self.write_output_prompt()
--> 238             format_dict = self.compute_format_data(result)
    239             self.write_format_data(format_dict)
    240             self.update_user_ns(result)

/Users/wesm/code/repos/ipython/IPython/core/displayhook.pyc in compute_format_data(self, result)
    148             MIME type representation of the object.
    149         """
--> 150         return self.shell.display_formatter.format(result)
    151 
    152     def write_format_data(self, format_dict):

/Users/wesm/code/repos/ipython/IPython/core/formatters.pyc in format(self, obj, include, exclude)
    124                     continue
    125             try:
--> 126                 data = formatter(obj)
    127             except:
    128                 # FIXME: log the exception

/Users/wesm/code/repos/ipython/IPython/core/formatters.pyc in __call__(self, obj)
    445                 type_pprinters=self.type_printers,
    446                 deferred_pprinters=self.deferred_printers)
--> 447             printer.pretty(obj)
    448             printer.flush()
    449             return stream.getvalue()

/Users/wesm/code/repos/ipython/IPython/lib/pretty.pyc in pretty(self, obj)
    349             if hasattr(obj_class, '_repr_pretty_'):
    350                 return obj_class._repr_pretty_(obj, self, cycle)
--> 351             return _default_pprint(obj, self, cycle)
    352         finally:
    353             self.end_group()

/Users/wesm/code/repos/ipython/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
    469     if getattr(klass, '__repr__', None) not in _baseclass_reprs:
    470         # A user-provided repr.
--> 471         p.text(repr(obj))
    472         return
    473     p.begin_group(1, '<')

/Users/wesm/code/pandas/pandas/core/frame.pyc in __repr__(self)
    458                 self.info(buf=buf, verbose=self._verbose_info)
    459             else:
--> 460                 self.to_string(buf=buf)
    461                 value = buf.getvalue()
    462                 if max([len(l) for l in value.split('\n')]) > terminal_width:

/Users/wesm/code/pandas/pandas/core/frame.pyc in to_string(self, buf, columns, col_space, colSpace, header, index, na_rep, formatters, float_format, sparsify, nanRep, index_names, justify, force_unicode)
   1038                                            index_names=index_names,
   1039                                            header=header, index=index)
-> 1040         formatter.to_string(force_unicode=force_unicode)
   1041 
   1042         if buf is None:

/Users/wesm/code/pandas/pandas/core/format.pyc in to_string(self, force_unicode)
    193 
    194             if self.index:
--> 195                 to_write.append(adjoin(1, str_index, *stringified))
    196             else:
    197                 to_write.append(adjoin(1, *stringified))

/Users/wesm/code/pandas/pandas/core/common.pyc in adjoin(space, *lists)
    398     toJoin = zip(*newLists)
    399     for lines in toJoin:
--> 400         outLines.append(''.join(lines))
    401     return '\n'.join(outLines)
    402 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

here are lines

('0  ', u'                        .gitignore ', u'     5 ', ' \xe2\x80\xa2\xe2\x80\xa2\xe2\x80\xa2\xe2\x80\xa2\xe2\x80\xa2')
hammer commented 12 years ago

I'm hitting this issue right now with the MovieLens 100k dataset, which uses the iso8859_2 encoding (as inferred by chardet).

My issue: when I call df.to_string, I can pass "force_unicode=True". However, I am not sure how to set force_unicode=True for all calls to to_string, e.g. any time repr() is called on df, which occurs when printing the df to the shell.

Gnarly issue. Character encodings in Python are never fun.

takluyver commented 12 years ago

@hammer : Is this with the current development version of pandas? Can you post the traceback somewhere?

wesm commented 12 years ago

@hammer I'm able to reproduce the issue on the movielens data. If you pass encoding='iso8859_2' when you use read_csv everything works fine.

In [8]: df = read_csv('/Users/wesm/code/pandas/pandas/tests/unicode_series.csv', header=None, encoding='iso8859_2')

In [9]: df
Out[9]: 
     X.1                                                 X.2
0   1582                Invitation, The (Zaproszenie) (1986)
1   1583                      Symphonie pastorale, La (1946)
2   1584                               American Dream (1990)
3   1585                               Lashou shentan (1992)
4   1586                       Terror in a Texas Town (1958)
5   1587                                Salut cousin! (1996)
6   1588                                  Schizopolis (1996)
7   1589                              To Have, or Not (1995)
8   1590                               Duoluo tianshi (1995)
9   1591                              Magic Hour, The (1998)
10  1592                           Death in Brunswick (1991)
11  1593                                      Everest (1998)
12  1594                                     Shopping (1994)
13  1595                            Nemesis 2: Nebula (1995)
14  1596                               Romper Stomper (1992)
15  1597                             City of Industry (1997)
16  1598                       Someone Else's America (1995)
17  1599                                 Guantanamera (1994)
18  1600                                Office Killer (1997)
19  1601                        Price Above Rubies, A (1998)
20  1602                                       Angela (1995)
21  1603                           He Walked by Night (1948)
22  1604                                Love Serenade (1996)
23  1605                                     Deceiver (1997)
24  1606                            Hurricane Streets (1998)
25  1607                                        Buddy (1997)
26  1608                                      B*A*P*S (1997)
27  1609                  Truth or Consequences, N.M. (1997)
28  1610                           Intimate Relations (1996)
29  1611                             Leading Man, The (1996)
30  1612                                   Tokyo Fist (1995)
31  1613                     Reluctant Debutante, The (1958)
32  1614                           Warriors of Virtue (1997)
33  1615                                 Desert Winds (1995)
34  1616                                    Hugo Pool (1997)
35  1617                             King of New York (1990)
36  1618                              All Things Fair (1996)
37  1619                               Sixth Man, The (1997)
38  1620                               Butterfly Kiss (1995)
39  1621                                Paris, France (1993)
40  1622                                Cérémonie, La (1995)
41  1623                                         Hush (1998)
42  1624                                   Nightwatch (1997)
43  1625          Nobody Loves Me (Keiner liebt mich) (1994)
44  1626                                    Wife, The (1995)
45  1627                                     Lamerica (1994)
46  1628                                    Nico Icon (1995)
47  1629  Silence of the Palace, The (Saimt el Qusur) (1994)
48  1630                               Slingshot, The (1993)
49  1631         Land and Freedom (Tierra y libertad) (1995)
50  1632                  Á köldum klaka (Cold Fever) (1994)
51  1633     Etz Hadomim Tafus (Under the Domin Tree) (1994)
52  1634                                 Two Friends (1986) 
53  1635                          Brothers in Trouble (1995)
54  1636                                   Girls Town (1996)
55  1637                                  Normal Life (1996)
56  1638                 Bitter Sugar (Azucar Amargo) (1996)
57  1639                              Eighth Day, The (1996)
58  1640                                     Dadetown (1995)
59  1641                            Some Mother's Son (1996)
60  1642                                   Angel Baby (1995)
61  1643                             Sudden Manhattan (1996)
62  1644                             Butcher Boy, The (1998)
63  1645                                Men With Guns (1997)
64  1646                                      Hana-bi (1997)
65  1647                             Niagara, Niagara (1997)
66  1648                                 Big One, The (1997)
67  1649                             Butcher Boy, The (1998)
68  1650                        Spanish Prisoner, The (1997)
69  1651                    Temptress Moon (Feng Yue) (1996)
70  1652   Entertaining Angels: The Dorothy Day Story (1996)
71  1653                        Chairman of the Board (1998)
72  1654                                   Favor, The (1994)
73  1655                                  Little City (1998)
74  1656                                       Target (1995)
75  1657                       Substance of Fire, The (1996)
76  1658                     Getting Away With Murder (1996)
77  1659                                  Small Faces (1995)
78  1660                                 New Age, The (1994)
79  1661                                  Rough Magic (1995)
80  1662                             Nothing Personal (1995)
81  1663                      8 Heads in a Duffel Bag (1997)
82  1664                            Brother's Kiss, A (1997)
83  1665                                         Ripe (1996)
84  1666                               Next Step, The (1995)
85  1667                           Wedding Bell Blues (1996)
86  1668                            MURDER and murder (1996)
87  1669                                      Tainted (1998)
88  1670                           Further Gesture, A (1996)
89  1671                                         Kika (1993)
90  1672                                       Mirage (1995)
91  1673                                   Mamma Roma (1962)
92  1674                               Sunchaser, The (1996)
93  1675                             War at Home, The (1996)
94  1676                                Sweet Nothing (1995)
95  1677                                   Mat' i syn (1997)
96  1678                                    B. Monkey (1998)
97  1679                                Sliding Doors (1998)
98  1680                                 You So Crazy (1994)
99  1681           Scream of Stone (Schrei aus Stein) (1991)

I'll see about doing this automatically with chardet or some way to modify the repr code to not blow up with a UnicodeError

wesm commented 12 years ago

ok @hammer I think I have this sorted out. If you don't specify the encoding it will not blow up anymore:

In [3]: df = read_csv('pandas/tests/unicode_series.csv', header=None)
In [4]: df
Out[4]: 
     X.1                                                 X.2
0   1617                             King of New York (1990)
1   1618                              All Things Fair (1996)
2   1619                               Sixth Man, The (1997)
3   1620                               Butterfly Kiss (1995)
4   1621                                Paris, France (1993)
5   1622                                C?r?monie, La (1995)
6   1623                                         Hush (1998)
7   1624                                   Nightwatch (1997)
8   1625          Nobody Loves Me (Keiner liebt mich) (1994)
9   1626                                    Wife, The (1995)
10  1627                                     Lamerica (1994)
11  1628                                    Nico Icon (1995)
12  1629  Silence of the Palace, The (Saimt el Qusur) (1994)
13  1630                               Slingshot, The (1993)
14  1631         Land and Freedom (Tierra y libertad) (1995)
15  1632                  ? k?ldum klaka (Cold Fever) (1994)
16  1633     Etz Hadomim Tafus (Under the Domin Tree) (1994)
17  1634                                  Two Friends (1986)

but if you do, it will render the Unicode correctly in the console.

In [5]: df = read_csv('pandas/tests/unicode_series.csv', header=None, encoding='iso-8859-2')
In [6]: df
Out[6]: 
     X.1                                                 X.2
0   1617                             King of New York (1990)
1   1618                              All Things Fair (1996)
2   1619                               Sixth Man, The (1997)
3   1620                               Butterfly Kiss (1995)
4   1621                                Paris, France (1993)
5   1622                                Cérémonie, La (1995)
6   1623                                         Hush (1998)
7   1624                                   Nightwatch (1997)
8   1625          Nobody Loves Me (Keiner liebt mich) (1994)
9   1626                                    Wife, The (1995)
10  1627                                     Lamerica (1994)
11  1628                                    Nico Icon (1995)
12  1629  Silence of the Palace, The (Saimt el Qusur) (1994)
13  1630                               Slingshot, The (1993)
14  1631         Land and Freedom (Tierra y libertad) (1995)
15  1632                  Á köldum klaka (Cold Fever) (1994)
16  1633     Etz Hadomim Tafus (Under the Domin Tree) (1994)
17  1634                                  Two Friends (1986)

Short of shipping chardet I don't know if there's a way to automatically infer the encoding

wesm commented 12 years ago

However this broke Python 3 tests. leaving issue open

takluyver commented 12 years ago

I'll look into it with Python 3.

takluyver commented 12 years ago

I get one failure, with reading the newly added CSV file. pandas.core.common._get_handle (https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L651) returns a text-mode file handle in Python 3, and the default behaviour is to throw errors if the file can't be decoded with the (platform dependent) default encoding. Specifying errors="replace" gets the same behaviour as in Python 2 here (unknown characters replaced with �), but I've not looked at where else _get_handle is used.