nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

MagyarLanc tokenizer + emMorph+emLem messes up GATE #10

Open DavidNemeskey opened 7 years ago

DavidNemeskey commented 7 years ago

When using the MagyarLanc tokenizer with emMorph+emLem, some input can get the output messed up. The anas of the next word will be empty, and it will be "shifted" to the word after that. This shift affects the whole text, but it skips space tokens, which remain correct. The shift adds up: i.e. if there is another problematic token down the line, the analyses of words after that will shifted by two positions, and so on. What's more, the shifted analyses stay in the memory of the GATE server, and will affect any subsequent text passed to the server.

Because GitHub + attachments don't go well together, I list the information needed to reproduce this problem below. The third step in the pipeline is not required, but I added it so that the lemma is exposed as well. As I see it, the problem is casued by the .<newline>. part, which the tokenizer tokenizes into a single token .. (btw. the downloadable version of ML3 returns two separate tokens in this case). However, emLem (I guess) returns the lemma with the newline in it. Whether this is related to the problem or not, I don't know.

Input:

Intelligens rendszeroperáció biztosítja a folyamatos, zavarmentes működést, távdiagnosztikával a gyors rendszerellenőrzést és konfigurációmódosítást is lehetővé téve.
. Szervizünk vállalja az azonnali, 24 órás hibaelhárítás megkezdést, a szolgáltatások igényelt formája szerint.

Configuration:

# ML Tokenizer + HFST
# -----

# HU 1. "emToken" Sentence Splitter and Tokenizer (QunToken, native) [Linux]
# hu.nytud.gate.tokenizers.QunTokenCommandLine
com.precognox.kconnect.gate.magyarlanc.HungarianTokenizerSentenceSplitter

# HU 2. "emMorph+emLem" Morphological Analyzer and Lemmatizer (HFST, hfst, native+java)
hu.nytud.gate.morph.HFSTMorphAndLemma

# HU 3. "emTag" POS Tagger and Lemmatizer (PurePOS in magyarlanc3.0, hfst)
hu.nytud.gate.postaggers.Magyarlanc3POSTaggerLemmatizer

Relevant parts of the output:

<Annotation Id="34" Type="Token" StartNode="165" EndNode="168">
<Feature>
  <Name className="java.lang.String">anas</Name>
  <Value className="java.util.ArrayList"></Value>
</Feature>
<Feature>
  <Name className="java.lang.String">string</Name>
  <Value className="java.lang.String">..</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">hfstana</Name>
  <Value className="java.lang.String">[/Num|Digit][_Ord/Adj][Nom][]</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">feature</Name>
  <Value className="java.lang.String">SubPOS=o|Num=s|Cas=n|Form=d|NumP=none|PerP=none|NumPd=none</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">pos</Name>
  <Value className="java.lang.String">Num</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">length</Name>
  <Value className="java.lang.Long">3</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">lemma</Name>
  <Value className="java.lang.String">.
.</Value>
</Feature>
</Annotation>
<Annotation Id="35" Type="Sentence" StartNode="169" EndNode="278">
</Annotation>
<Annotation Id="36" Type="Token" StartNode="169" EndNode="179">
<Feature>
  <Name className="java.lang.String">anas</Name>
  <Value className="java.util.ArrayList"></Value>
</Feature>
<Feature>
  <Name className="java.lang.String">string</Name>
  <Value className="java.lang.String">Szervizünk</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">hfstana</Name>
  <Value className="java.lang.String">[/V][Prs.NDef.1Pl]</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">feature</Name>
  <Value className="java.lang.String">SubPOS=m|Mood=i|Tense=s|Per=none|Num=none|Def=none</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">pos</Name>
  <Value className="java.lang.String">V</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">length</Name>
  <Value className="java.lang.Long">10</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">lemma</Name>
  <Value className="java.lang.String">szerviz</Value>
</Feature>
</Annotation>
<Annotation Id="37" Type="SpaceToken" StartNode="168" EndNode="169">
<Feature>
  <Name className="java.lang.String">length</Name>
  <Value className="java.lang.Long">1</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">string</Name>
  <Value className="java.lang.String"></Value>
</Feature>
</Annotation>
<Annotation Id="38" Type="Token" StartNode="180" EndNode="188">
<Feature>
  <Name className="java.lang.String">anas</Name>
  <Value className="java.util.ArrayList" itemClassName="java.lang.String">{ana=szer[/N]=szer+víz[/N]=viz+ünk[Poss.1Pl]=ünk+[Nom], feats=[/N][Poss.1Pl][Nom], lemma=szervíz};{ana=szerv[/N]=szerv+i[_Adjz:i/Adj]=i+z[_NVbz_Tr:z/V]=z+ünk[Prs.NDef.1Pl]=ünk, feats=[/V][Prs.NDef.1Pl], lemma=szerviz};{ana=szerv[/N]=szerv+i[_Adjz:i/Adj]=i+zik[_NVbz_Ntr:zik/V]=z+ünk[Prs.NDef.1Pl]=ünk, feats=[/V][Prs.NDef.1Pl], lemma=szervizik};{ana=szervi[/Adj]=szervi+z[_NVbz_Tr:z/V]=z+ünk[Prs.NDef.1Pl]=ünk, feats=[/V][Prs.NDef.1Pl], lemma=szerviz};{ana=szervi[/Adj]=szervi+zik[_NVbz_Ntr:zik/V]=z+ünk[Prs.NDef.1Pl]=ünk, feats=[/V][Prs.NDef.1Pl], lemma=szervizik};{ana=szerviz[/N]=szerviz+ünk[Poss.1Pl]=ünk+[Nom], feats=[/N][Poss.1Pl][Nom], lemma=szerviz}</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">string</Name>
  <Value className="java.lang.String">vállalja</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">hfstana</Name>
  <Value className="java.lang.String">[/N][Poss.1Pl][Nom]</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">feature</Name>
  <Value className="java.lang.String">SubPOS=c|Num=s|Cas=n|NumP=none|PerP=none|NumPd=none</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">pos</Name>
  <Value className="java.lang.String">N</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">length</Name>
  <Value className="java.lang.Long">8</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">lemma</Name>
  <Value className="java.lang.String">szerviz</Value>
</Feature>
</Annotation>
<Annotation Id="39" Type="SpaceToken" StartNode="179" EndNode="180">
<Feature>
  <Name className="java.lang.String">length</Name>
  <Value className="java.lang.Long">1</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">string</Name>
  <Value className="java.lang.String"></Value>
</Feature>
</Annotation>
<Annotation Id="40" Type="Token" StartNode="189" EndNode="191">
<Feature>
  <Name className="java.lang.String">anas</Name>
  <Value className="java.util.ArrayList" itemClassName="java.lang.String">{ana=váll[/N]=váll+alj[/N]=alj+a[Poss.3Sg]=a+[Nom], feats=[/N][Poss.3Sg][Nom], lemma=vállalj};{ana=váll[/N]=váll+alja[/N]=alj+a[Poss.3Sg]=a+[Nom], feats=[/N][Poss.3Sg][Nom], lemma=vállalja};{ana=vállal[/V]=vállal+ja[Prs.Def.3Sg]=ja, feats=[/V][Prs.Def.3Sg], lemma=vállal};{ana=vállal[/V]=vállal+ja[Sbjv.Def.3Sg]=ja, feats=[/V][Sbjv.Def.3Sg], lemma=vállal}</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">string</Name>
  <Value className="java.lang.String">az</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">hfstana</Name>
  <Value className="java.lang.String">[/N][Poss.3Sg][Nom]</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">feature</Name>
  <Value className="java.lang.String">SubPOS=c|Num=s|Cas=n|NumP=s|PerP=3|NumPd=none</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">pos</Name>
  <Value className="java.lang.String">N</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">length</Name>
  <Value className="java.lang.Long">2</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">lemma</Name>
  <Value className="java.lang.String">vállalja</Value>
</Feature>
</Annotation>
sassbalint commented 7 years ago

The config file you provided does not use emToken. Please, make clear which module causes the problem (emMorph+emLem? emTag?), and change the issue accordingly.

DavidNemeskey commented 7 years ago

You are right, I meant emLem (or whatever hu.nytud.gate.morph.HFSTMorphAndLemma is called). As I said before, the third step in the config file (emTag) is not necessary to reproduce the problem, but perhaps it gives more information. So it's either the first or the second step; I guessed the second (emLem), because when only the first step is run, the output seems OK (aside from #9).

DavidNemeskey commented 7 years ago

OK, as I suspected, the problem is the .<newline>. part. HungarianTokenizerSentenceSplitter returns it as a .. token. However, HFSTMorphAndLemma takes the string from the original text, not from the tokenizer, and so it passes the newline to HFST, which perceives it as two words.

There could be two solutions to this issue:

  1. HungarianTokenizerSentenceSplitter should not join token( fragment)s separated by newline(s)
  2. HFSTMorphAndLemma should consume the token returned by the tokenizer, and not look at the original text

I vote for the second option, because

  1. it is only natural that modules build on each other's outputs
  2. there are valid cases where joining fragments over newlines makes sense, such as syllabification

BTW the source of the HFST wrappers should be made available as well.

@sassbalint Opinions? Could you also assign the people responsible for these components to this issue? Thx.

DavidNemeskey commented 7 years ago

Created a pull request (#12) in which HFSTMorphAndLemma takes the token from the tokenizer (string feature). However, the next module (PurePos) at least, also reads the original text, so this is still not a final solution. It would be really good to have an agreement about this.

DavidNemeskey commented 7 years ago

Ping.