mikeliu763 / pyector

Automatically exported from code.google.com/p/pyector
0 stars 0 forks source link

Unicode character in words not in tokens #8

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
When calling Entry.getTokens() on "Comment ça va?", one gets only
["Comment","a","va","?"]. The trouble is "a" instead of "ça".

The test in EntryTest.py, EntryTest.testTokensUnicode() does not work.

getTokens() uses regular expressions, and the one in question is reWORDS:

reWORDS    = re.compile(r'\b\w+\b', re.LOCALE)

which works well in python shell IDLE (on Windows).

Maybe a question on encoding of sys.stdin ?

I don't know how to set it to utf8!
Even with re.UNICODE on all re, it does not work. :/

Original issue reported on code.google.com by francois.parmentier@gmail.com on 10 Nov 2008 at 10:53

GoogleCodeExporter commented 8 years ago

Original comment by francois.parmentier@gmail.com on 10 Nov 2008 at 10:54

GoogleCodeExporter commented 8 years ago
This (French, sorry) page convey light on encoding:
http://pythonfacile.free.fr/python/unicode.html

Original comment by francois.parmentier@gmail.com on 11 Nov 2008 at 12:04

GoogleCodeExporter commented 8 years ago
Perhaps have we to use the -S option of python to prevent import of site and
sitecustomize, which remove sys.setdefaultencoding.

Maybe, getting the default encoding:
encoding = locale.getdefaultlocale()[1]

And setting the good encoding:
sys.setdefaultencoding(encoding) would be better for unicode.

OR systematically use: entry = unicode(entry, encoding) 
where encoding is given by the default locale?

Original comment by francois.parmentier@gmail.com on 11 Nov 2008 at 12:24

GoogleCodeExporter commented 8 years ago
Maybe that could help: http://www.amk.ca/python/howto/unicode

After encoding the entry to unicode (from defaultencoding), the test runs.

BUT @shownodes don't any more:
----8<----
User>Comment ça va?
User>@shownodes
   Comment (   token): 1 (1,0,0)
Traceback (most recent call last):
  File "C:\Users\François\workspace2\pyector\src\Ector.py", line 481, in <module>
    status = main()
  File "C:\Users\François\workspace2\pyector\src\Ector.py", line 440, in main
    ector.cn.showNodes()
  File "C:\Users\François\workspace2\pyector\src\ConceptNetwork.py", line 95, in
showNodes
    self.getNode(symbol,type).show()
  File "C:\Users\François\workspace2\pyector\src\Ector.py", line 129, in show
    self.beg
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 8:
ordinal not in range(128)
----8<----

Original comment by francois.parmentier@gmail.com on 11 Nov 2008 at 9:28

GoogleCodeExporter commented 8 years ago
Adding:
----8<----
import sys, locale

ENCODING    = locale.getdefaultlocale()[1]
DEFAULT_ENCODING    = sys.getdefaultencoding()
----8<----
at the beginning of a file
and using :
----8<----
    def show(self):
        """Display the node"""
        print "%10s (%8s): %d (%d,%d,%d)" % (self.getSymbol().encode(ENCODING),
----8<----
every time one wants to use print fixes the issue (in r117).

Original comment by francois.parmentier@gmail.com on 11 Nov 2008 at 12:32