justhalf / bpe_analysis

Analysis of BPE on four languages: English, Indonesian, Chinese, Japanese
0 stars 1 forks source link

Running morphological analyzers in English and Japanese #2

Open notani opened 5 years ago

notani commented 5 years ago

Normalizing allomorphs (?) so that BPE can find identical morphemes across words.

cats -> cat+s
boxes -> box+s

usefulness -> useful+ness
happiness -> happy+ness
食べる taberu = eat
食べた tabeta = eat+past -> 食べる+た taberu+ta
食べなかった tabenakatta = did not eat -> 食べる+ない+た taberu+nai+ta

I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.

justhalf commented 5 years ago

Should we keep the plus sign? In my Indonesian morphology normalizer script I didn't. Also, are you working on this?

notani commented 5 years ago

Should we keep the plus sign?

No, + is just for an illustration purpose.

I will do English and Japanese normalization. Were you also working on this?

justhalf commented 5 years ago

Yes, the scripts for Indonesian is at https://github.com/justhalf/bpe_analysis/blob/master/morphind/process_txt.py It uses MorphInd for the morphology analyzer.

No, + is just for an illustration purpose.

I asked because in the Indonesian one I explicitly remove +. I think we should remove the plus sign, yeah.

notani commented 5 years ago

Did you start English and Japanese normalization, too?

justhalf commented 5 years ago

Did you start English and Japanese normalization, too?

No, I haven't started. I didn't know which morphology analyzer to use. But if we have them, we can simply replace the subprocess call with the corresponding call.

justhalf commented 5 years ago

In this paper it says the lexicon (19MB) is large: image

Does anyone know a good English morphology analyzer? I was surprised to find none, only Morfessor, which was automatic.

justhalf commented 5 years ago

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

notani commented 5 years ago

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

Does this output surface forms of morphemes?

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly: https://github.com/ryancotterell/treeseg

We can search for similar studies by "morphological segmentation" rather than "morphological analysis"

justhalf commented 5 years ago

Based on my cursory look, it seems so.

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly: https://github.com/ryancotterell/treeseg

That's a good one, since it is modern. I was looking at more that has more manual analysis, since it will be less automatic, e.g., FST. But couldn't find FST for English.

notani commented 5 years ago

How about this? https://github.com/knowitall/morpha

justhalf commented 5 years ago

That looks good. You have one for Japanese as well? (I guess we don't need this for Chinese?)

notani commented 5 years ago

Fortunately, Japanese segmentation by UDPipe is already morpheme segmentation and has normalized forms. I think we don't need normalization for Chinese.

Can you do English normalization?

justhalf commented 5 years ago

I am trying. Morpha apparently only handles plural nouns and verb inflections, but not derivations. So happiness stays as is.