hunpos fails on non-latin characters on 51 lines

juditacs / semeval

MathLing Budapest Team's repo

MIT License

10 stars 9 forks source link

hunpos fails on non-latin characters on 51 lines #7

Closed recski closed 9 years ago

recski commented 9 years ago

@juditacs @zseder FYI hunpos tag must not be called with input that cannot be encoded to latin1, and we have been passing unicode objects. I'm not sure this should be "fixed" in hunpos, as that would involve using some extra, possibly third-party code for transliteration, and there still wouldn't always be a working way, let alone a unique way to do that. So basically I agree with the nltk-hunpos developer that it should be the caller's responsiblity to pass latin-1 encodable input to hunpos. In that spirit, I've added an "iconv -f UTF8 -t LATIN1//TRANSLIT" to our preprocessing pipeline, which just took care of all our problems. Comments welcome!

juditacs commented 9 years ago

I disagree with transliterating everything to latin1 just because Hunpos is old. I think we should only encode the text to latin1 then decode it again upon calling HunposTagger.

recski commented 9 years ago

I don't get it, do you mean "decoding after calling HunposTagger"? If so, I still don't really see the point, what is there to gain?

On Thu, Nov 20, 2014 at 4:48 PM, Judit Acs notifications@github.com wrote:

I disagree with transliterating everything to latin1 just because Hunpos is old. I think we should only encode the text to latin1 then decode it again upon calling HunposTagger.

— Reply to this email directly or view it on GitHub https://github.com/juditacs/semeval/issues/7#issuecomment-63828374.

juditacs commented 9 years ago

I meant just before calling HunposTagger, we encode the strings in latin1 then decode them (the tagger expects unicode if I understand correctly) and feed HunposTagger with the latin1-encodable unicode input. We align the Hunpos output with the original (lossless) input.

recski commented 9 years ago

HunposTagger expects input that it can encode to latin1, but I think I know what you mean. Sure, we could keep unicode data, I just didn't care, since we are procesing English input and the only non-ascii characters in there (em-dashes, apostrophes, an occasional accent, altogether there's only about 50 lines containing any of these) can be transliterated to latin1 with iconv without any real loss of information.

On Thu, Nov 20, 2014 at 8:20 PM, Judit Acs notifications@github.com wrote:

I meant just before calling HunposTagger, we encode the strings in latin1 then decode them (the tagger expects unicode if I understand correctly) and feed HunposTagger with the latin1-encodable unicode input. We align the Hunpos output with the original (lossless) input.

— Reply to this email directly or view it on GitHub https://github.com/juditacs/semeval/issues/7#issuecomment-63863339.