Closed DavidNemeskey closed 5 years ago
Sorry, but this verison of e-magyar is no longer maintained. Please refer to https://github.com/dlt-rilmta/e-magyar-tsv for the newer version.
Regarding your issue: In the newer version we still not plan to implement your request, but you are given the opportunity to easily create your own module, which does the mentioned kind of text normalisation that is not belong to any existing module. In any case, we would be happy to include your module in the new version of e-magyar.
@DavidNemeskey Some further thoughts on why we do not plan to implement such functionality (at least currently). The problem in question is one particular problem of spellchecking. First question is: do we need spellchecking (and corrected spelling) in a corpus processing pipeline at all? The answer is basically 'no' because it is generally worth sticking to the original corpus text. But if the answer is 'yes' then, I think, a fully functional spellchecker is needed as a standalone module (maybe after the tokenizer and before the morpho analyzer). It does not seem to be a good idea to implement one particular spellchecking functionality somewhere hidden in a module of a corpus proc pipeline. Our 'solution' might be to simply delete "* [/Space]". :)
Bálint! Nagyon ügyes, és jól kifejtett érv. Üdv. T
@DavidNemeskey https://github.com/DavidNemeskey Some further thoughts on why we do not plan to implement such functionality (at least currently). The problem in question is one particular problem of spellchecking. First question is: do we need spellchecking (and corrected spelling) in a corpus processing pipeline? The answer is basically 'no' because it is generally worth sticking to the original corpus text. But if the answer is 'yes' then, I think, a fully functional spellchecker is needed as a standalone module (maybe after the tokenizer and before the morpho analyzer). It does not seem to be a good idea to implement one particular spellchecking functionality somewhere hidden in a module of a corpus proc pipeline. Our 'solution' might be to simply delete "* [/Space]". :)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dlt-rilmta/hunlp-GATE/issues/28#issuecomment-437581415, or mute the thread https://github.com/notifications/unsubscribe-auth/ALf1v_5QNxNmmXm8_2JqDkv_IHucvHcFks5utspXgaJpZM4YNATI .
@sassbalint I would be happy with both solutions: separating the token into two or deleting the ** *[/Space]
part. The main point is that the lemma is meaningful, which it is not right now.
For some mistakenly concatenated words, emMorph has a special output indicating that the input should actually be two tokens:
The GATE module should handle this as well, by splitting the token in questions into two.