google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

Phonetic ngrams in POS ngram DFE can't be turned off? Crash on some characters #133

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run the lucene POS ngram DFE on twitter data
2. Get soundex exception of "character cannot be phonetized"
3. The phonetization seems to be set in POS meta collector, cannot be turned 
off from FE parameters in groovy

What is the expected output? What do you see instead?

These two things (POS ngrams and phonetic ngrams) shall be definitely 
separable. I am confused about this revision, is that intended this way?? If 
so, why?

  public class LucenePOSNGramMetaCollector 
      extends LuceneBasedMetaCollector 
  { 
-    @ConfigurationParameter(name =   
LucenePOSNGramFeatureExtractorBase.PARAM_POS_NGRAM_MIN_N, mandatory = true,   
defaultValue = "1") 
+    @ConfigurationParameter(name =   
LucenePhoneticNGramFeatureExtractorBase.PARAM_PHONETIC_NGRAM_MIN_N,   
mandatory = true, defaultValue = "1") 
      private int posNgramMinN; 

Original issue reported on code.google.com by l.flek...@gmail.com on 23 May 2014 at 9:13

GoogleCodeExporter commented 9 years ago
You are right to be confused.
A bad case of c&p gone wrong.
Thanks for spotting that.
I will fix that so that POS ngram is going to work again as expected.

The issue with "character cannot be phonetized" should however be fixed as well.
Do you have an example that causes this exception. 

Original comment by torsten....@gmail.com on 24 May 2014 at 9:03

GoogleCodeExporter commented 9 years ago
Thx. Most likely it was either #, @ or a smiley, given that it was reading 
tweets. I'll check that out once my current pipeline finishes.

Original comment by l.flek...@gmail.com on 24 May 2014 at 9:51

GoogleCodeExporter commented 9 years ago
I tried with a bunch of special characters and they all worked fine.

So it would be good to have the bad string in order to help me reproduce the 
problem.

Original comment by torsten....@gmail.com on 24 May 2014 at 6:46

GoogleCodeExporter commented 9 years ago
Okay, so it is this character: ʉ in this sequence: �ʉ�_ which happens to 
be present in some of the hyperlinks. Probably I messed up some escaped 
character sequence in the data, so the error is between the chair and the 
laptop ;) For normal characters it should be fairly failsafe :-) 

Caused by: java.lang.IllegalArgumentException: The character is not mapped: Ʉ
    at org.apache.commons.codec.language.Soundex.map(Soundex.java:226)
    at org.apache.commons.codec.language.Soundex.getMappingCode(Soundex.java:180)
    at org.apache.commons.codec.language.Soundex.soundex(Soundex.java:264)
    at org.apache.commons.codec.language.Soundex.encode(Soundex.java:162)
    at de.tudarmstadt.ukp.dkpro.tc.features.ngram.util.NGramUtils.getDocumentPhoneticNgrams(NGramUtils.java:167)

It happens in the MetaTask which 

Original comment by l.flek...@gmail.com on 24 May 2014 at 8:06

GoogleCodeExporter commented 9 years ago
I tested with those characters and I got no mapping errors here.

So we will leave that issue closed until someone runs into the same problem 
again :)

Original comment by torsten....@gmail.com on 24 May 2014 at 8:10

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 13 Jun 2014 at 3:18