Open notani opened 5 years ago
Should we keep the plus sign? In my Indonesian morphology normalizer script I didn't. Also, are you working on this?
Should we keep the plus sign?
No, +
is just for an illustration purpose.
I will do English and Japanese normalization. Were you also working on this?
Yes, the scripts for Indonesian is at https://github.com/justhalf/bpe_analysis/blob/master/morphind/process_txt.py It uses MorphInd for the morphology analyzer.
No,
+
is just for an illustration purpose.
I asked because in the Indonesian one I explicitly remove +
. I think we should remove the plus sign, yeah.
Did you start English and Japanese normalization, too?
Did you start English and Japanese normalization, too?
No, I haven't started. I didn't know which morphology analyzer to use. But if we have them, we can simply replace the subprocess call with the corresponding call.
In this paper it says the lexicon (19MB) is large:
Does anyone know a good English morphology analyzer? I was surprised to find none, only Morfessor, which was automatic.
Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox
Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox
Does this output surface forms of morphemes?
This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly
:
https://github.com/ryancotterell/treeseg
We can search for similar studies by "morphological segmentation" rather than "morphological analysis"
Based on my cursory look, it seems so.
This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly: https://github.com/ryancotterell/treeseg
That's a good one, since it is modern. I was looking at more that has more manual analysis, since it will be less automatic, e.g., FST. But couldn't find FST for English.
How about this? https://github.com/knowitall/morpha
That looks good. You have one for Japanese as well? (I guess we don't need this for Chinese?)
Fortunately, Japanese segmentation by UDPipe is already morpheme segmentation and has normalized forms. I think we don't need normalization for Chinese.
Can you do English normalization?
I am trying. Morpha apparently only handles plural nouns and verb inflections, but not derivations. So happiness
stays as is.
Normalizing allomorphs (?) so that BPE can find identical morphemes across words.
I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.