nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

együttjárhatnak: incorrect POS tags #8

Open DavidNemeskey opened 7 years ago

DavidNemeskey commented 7 years ago

The analysis of együttjárhatnak (QT,HFSTLemm,ML3-PosLem-hfstcode) is [V][_Mod][Prs.NDef.3Pl], which is incorrect: the tags [V] and [_Mod] should be [/V] and [_Mod/V], respectively.

HFST does not recognize the word (probably because it should be written separately), so it might be some fallback module that produces this analysis?

Similar invalid analyses are

sassbalint commented 7 years ago

Balázs, can you check this issue, please? :)

dlazesz commented 7 years ago

Did you checked the training corpus? Do we have a bugtracker for it at all? (We discovered numerous bugs in it.) For example: Edit (woman first name) has the lemma Edi and some Accusative tag...

I checked the corpus:

./cwszt.conll-2009_ready.disamb.new:együttjárhatnak együttjárhat    V   SubPOS=m|Mood=i|Tense=p|Per=3|Num=p|Def=n   együttjár[V][_Mod][Prs.NDef.3Pl]
./gazdtar.conll-2009_ready.disamb.new:utasíthatja.[Gt.  utasíthatja.[Gt.    X   _   utasít[V][_Mod][Prs.Def.3Sg][Punct]

@DavidNemeskey: Could you check if the gold standard analysis of each word of the traininig corpus is in the set of the given alalyses of emMorph? This would be the fist step of fixing this kind of issues.

@vinczev : If I get a newer version of the corpus i'll do a train and this issue will be solved instantly. (I do not want to make changes in it on my own as it would diverge from the versions used by others...) There should be some central repository with a bugtracker for the corpus too!

DavidNemeskey commented 7 years ago

@dlazesz Sorry, but I think this should be done by the owner of the corpus and the tagger, not a third party. :)

That said, the above three are all I have discovered; though I did not specifically look for these differences, I added a mapping for the erroneous tags, and this is the list I ended up with:

{
  '[N]': '[/N]',
  '[V]': '[/V]',
  '[Num]': '[/Num]',
  '[_Mod]': '[_Mod/V]'
}
dlazesz commented 7 years ago

I am not the owner of the corpus. To avoid later errors in the chain I'll wait for a new version of the corpus. This issue has nothing to do with the tagger.

@DavidNemeskey: Please be so kind and help the corpus owners by finding bugs in the corpus instead of blaming others, who have nothing to do with the issue.

DavidNemeskey commented 7 years ago

I am not blaming anybody, I just don't know where this error stems from. I have already listed all errors I found.

@vinczev I second the notion of having a bug-tracker for the corpus. The errors I sent a few weeks earlier (disagreement between the old and new-style tags, [Acc] missing, etc.) via email should also be fixed before a new model can be trained.