nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

Fully handle emMorph output #28

Closed DavidNemeskey closed 5 years ago

DavidNemeskey commented 5 years ago

For some mistakenly concatenated words, emMorph has a special output indicating that the input should actually be two tokens:

> úticél
úticél  út[/N]i[_Adjz:i/Adj]** *[/Space]cél[/N][Nom]    0.000000

The GATE module should handle this as well, by splitting the token in questions into two.

dlazesz commented 5 years ago

Sorry, but this verison of e-magyar is no longer maintained. Please refer to https://github.com/dlt-rilmta/e-magyar-tsv for the newer version.

Regarding your issue: In the newer version we still not plan to implement your request, but you are given the opportunity to easily create your own module, which does the mentioned kind of text normalisation that is not belong to any existing module. In any case, we would be happy to include your module in the new version of e-magyar.

sassbalint commented 5 years ago

@DavidNemeskey Some further thoughts on why we do not plan to implement such functionality (at least currently). The problem in question is one particular problem of spellchecking. First question is: do we need spellchecking (and corrected spelling) in a corpus processing pipeline at all? The answer is basically 'no' because it is generally worth sticking to the original corpus text. But if the answer is 'yes' then, I think, a fully functional spellchecker is needed as a standalone module (maybe after the tokenizer and before the morpho analyzer). It does not seem to be a good idea to implement one particular spellchecking functionality somewhere hidden in a module of a corpus proc pipeline. Our 'solution' might be to simply delete "* [/Space]". :)

tamvar commented 5 years ago

Bálint! Nagyon ügyes, és jól kifejtett érv. Üdv. T

  1. nov. 10., Szo 13:45 dátummal Sass Bálint notifications@github.com ezt írta:

@DavidNemeskey https://github.com/DavidNemeskey Some further thoughts on why we do not plan to implement such functionality (at least currently). The problem in question is one particular problem of spellchecking. First question is: do we need spellchecking (and corrected spelling) in a corpus processing pipeline? The answer is basically 'no' because it is generally worth sticking to the original corpus text. But if the answer is 'yes' then, I think, a fully functional spellchecker is needed as a standalone module (maybe after the tokenizer and before the morpho analyzer). It does not seem to be a good idea to implement one particular spellchecking functionality somewhere hidden in a module of a corpus proc pipeline. Our 'solution' might be to simply delete "* [/Space]". :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dlt-rilmta/hunlp-GATE/issues/28#issuecomment-437581415, or mute the thread https://github.com/notifications/unsubscribe-auth/ALf1v_5QNxNmmXm8_2JqDkv_IHucvHcFks5utspXgaJpZM4YNATI .

DavidNemeskey commented 5 years ago

@sassbalint I would be happy with both solutions: separating the token into two or deleting the ** *[/Space] part. The main point is that the lemma is meaningful, which it is not right now.