Closed GoogleCodeExporter closed 9 years ago
You are right to be confused.
A bad case of c&p gone wrong.
Thanks for spotting that.
I will fix that so that POS ngram is going to work again as expected.
The issue with "character cannot be phonetized" should however be fixed as well.
Do you have an example that causes this exception.
Original comment by torsten....@gmail.com
on 24 May 2014 at 9:03
Thx. Most likely it was either #, @ or a smiley, given that it was reading
tweets. I'll check that out once my current pipeline finishes.
Original comment by l.flek...@gmail.com
on 24 May 2014 at 9:51
I tried with a bunch of special characters and they all worked fine.
So it would be good to have the bad string in order to help me reproduce the
problem.
Original comment by torsten....@gmail.com
on 24 May 2014 at 6:46
Okay, so it is this character: ʉ in this sequence: �ʉ�_ which happens to
be present in some of the hyperlinks. Probably I messed up some escaped
character sequence in the data, so the error is between the chair and the
laptop ;) For normal characters it should be fairly failsafe :-)
Caused by: java.lang.IllegalArgumentException: The character is not mapped: Ʉ
at org.apache.commons.codec.language.Soundex.map(Soundex.java:226)
at org.apache.commons.codec.language.Soundex.getMappingCode(Soundex.java:180)
at org.apache.commons.codec.language.Soundex.soundex(Soundex.java:264)
at org.apache.commons.codec.language.Soundex.encode(Soundex.java:162)
at de.tudarmstadt.ukp.dkpro.tc.features.ngram.util.NGramUtils.getDocumentPhoneticNgrams(NGramUtils.java:167)
It happens in the MetaTask which
Original comment by l.flek...@gmail.com
on 24 May 2014 at 8:06
I tested with those characters and I got no mapping errors here.
So we will leave that issue closed until someone runs into the same problem
again :)
Original comment by torsten....@gmail.com
on 24 May 2014 at 8:10
Original comment by daxenber...@gmail.com
on 13 Jun 2014 at 3:18
Original issue reported on code.google.com by
l.flek...@gmail.com
on 23 May 2014 at 9:13