Open DavidNemeskey opened 7 years ago
Balázs, can you check this issue, please? :)
Did you checked the training corpus? Do we have a bugtracker for it at all? (We discovered numerous bugs in it.) For example: Edit (woman first name) has the lemma Edi and some Accusative tag...
I checked the corpus:
./cwszt.conll-2009_ready.disamb.new:együttjárhatnak együttjárhat V SubPOS=m|Mood=i|Tense=p|Per=3|Num=p|Def=n együttjár[V][_Mod][Prs.NDef.3Pl]
./gazdtar.conll-2009_ready.disamb.new:utasíthatja.[Gt. utasíthatja.[Gt. X _ utasít[V][_Mod][Prs.Def.3Sg][Punct]
@DavidNemeskey: Could you check if the gold standard analysis of each word of the traininig corpus is in the set of the given alalyses of emMorph? This would be the fist step of fixing this kind of issues.
@vinczev : If I get a newer version of the corpus i'll do a train and this issue will be solved instantly. (I do not want to make changes in it on my own as it would diverge from the versions used by others...) There should be some central repository with a bugtracker for the corpus too!
@dlazesz Sorry, but I think this should be done by the owner of the corpus and the tagger, not a third party. :)
That said, the above three are all I have discovered; though I did not specifically look for these differences, I added a mapping for the erroneous tags, and this is the list I ended up with:
{
'[N]': '[/N]',
'[V]': '[/V]',
'[Num]': '[/Num]',
'[_Mod]': '[_Mod/V]'
}
I am not the owner of the corpus. To avoid later errors in the chain I'll wait for a new version of the corpus. This issue has nothing to do with the tagger.
@DavidNemeskey: Please be so kind and help the corpus owners by finding bugs in the corpus instead of blaming others, who have nothing to do with the issue.
I am not blaming anybody, I just don't know where this error stems from. I have already listed all errors I found.
@vinczev I second the notion of having a bug-tracker for the corpus. The errors I sent a few weeks earlier (disagreement between the old and new-style tags, [Acc]
missing, etc.) via email should also be fixed before a new model can be trained.
The analysis of együttjárhatnak (
QT,HFSTLemm,ML3-PosLem-hfstcode
) is[V][_Mod][Prs.NDef.3Pl]
, which is incorrect: the tags[V]
and[_Mod]
should be[/V]
and[_Mod/V]
, respectively.HFST does not recognize the word (probably because it should be written separately), so it might be some fallback module that produces this analysis?
Similar invalid analyses are
[N][All]
[Num][Nom]
(interestingly enough, HFST returns an analysis for tíz-, so why doesn't it appear in GATE? This word was at the beginning of the sentence, hence the capitalization, but usually it is not a problem)