Closed michaelsilver closed 9 years ago
Since only what's loaded from infoclasses.json
is unicode, a better solution would be to transform the contents of that file into a str
. Replace infoclass.py line 7 line with:
# load infoclasses and convert from unicode to str
infoclass_pairs = json.load(open(config.DATA_DIRECTORY + 'infoclasses.json', 'r'))
infoclass_pairs = [[s.encode('utf-8') if s else None for s in l]
for l in infoclass_pairs]
(I used encode
instead of unidecode
because it's faster and I don't think there are any actual unicode characters in the JSON file).
Thus, you won't have to check the type of the class downstream or call unidecode
.
In general, you should avoid isinstance
(see http://www.quora.com/When-is-it-acceptable-to-use-isinstance-in-Python). But unicode vs byte strings are messy, so there may be exceptions.
Ok, took your suggestion, tested it and it works. Ready for merge?
Looks good, go ahead and merge
Originally,
classes_above_infobox()
would return a unicode object for the infobox's wiki_class, and then the rest of the "classes_above" as regular strings. Ex:I though this was annoying and could interfere with subsequent analysis, so I made the change to
unidecode
wiki_class in commit 6bbcdedf3c54cf330faa592412daf63bfadd702e (which, I apologize, I did not make a PR for).Now, I'm being careful and polite, so I made a PR. (Did not make an issue because this is sort of a patch for a would-be-PR for a would-be-issue.) Sometimes, wiki_class is not a unicode object, such as when a non-existant infobox is used (e.g. 'officeholder' returns unicode, but 'junk' returns string). So, I get this warning:
Thus, I propose to check whether
wiki_class
is unicode before decoding it, getting rid of this warning and any "unexpected results" that accompany it.