infolab-csail / defexpand

Generalizes definitions using DBpedia ontology and WordNet
2 stars 0 forks source link

proper handling of unicode #6

Closed michaelsilver closed 9 years ago

michaelsilver commented 9 years ago

Originally, classes_above_infobox() would return a unicode object for the infobox's wiki_class, and then the rest of the "classes_above" as regular strings. Ex:

>>> from defexpand import infoclass
>>> ontology = infoclass.get_info_ontology()
>>> ontology.classes_above_infobox('officeholder')
[u'OfficeHolder', 'Person', 'Agent', 'owl:Thing']

I though this was annoying and could interfere with subsequent analysis, so I made the change to unidecode wiki_class in commit 6bbcdedf3c54cf330faa592412daf63bfadd702e (which, I apologize, I did not make a PR for).

Now, I'm being careful and polite, so I made a PR. (Did not make an issue because this is sort of a patch for a would-be-PR for a would-be-issue.) Sometimes, wiki_class is not a unicode object, such as when a non-existant infobox is used (e.g. 'officeholder' returns unicode, but 'junk' returns string). So, I get this warning:

/scratch/msilver/.virtualenvs/wikimap/local/lib/python2.7/site-packages/defexpand-0.1.1-py2.7.egg/defexpand/infoclass.py:96: RuntimeWarning: Argument <type 'str'> is not an unicode object. Passing an encoded string will likely have unexpected results.
  wiki_class = unidecode(wiki_class)

Thus, I propose to check whether wiki_class is unicode before decoding it, getting rid of this warning and any "unexpected results" that accompany it.

alvaromorales commented 9 years ago

Since only what's loaded from infoclasses.json is unicode, a better solution would be to transform the contents of that file into a str. Replace infoclass.py line 7 line with:

# load infoclasses and convert from unicode to str
infoclass_pairs = json.load(open(config.DATA_DIRECTORY + 'infoclasses.json', 'r'))
infoclass_pairs = [[s.encode('utf-8') if s else None for s in l]
                   for l in infoclass_pairs]

(I used encode instead of unidecode because it's faster and I don't think there are any actual unicode characters in the JSON file).

Thus, you won't have to check the type of the class downstream or call unidecode.

In general, you should avoid isinstance (see http://www.quora.com/When-is-it-acceptable-to-use-isinstance-in-Python). But unicode vs byte strings are messy, so there may be exceptions.

michaelsilver commented 9 years ago

Ok, took your suggestion, tested it and it works. Ready for merge?

alvaromorales commented 9 years ago

Looks good, go ahead and merge