tshatrov / ichiran

Linguistic tools for texts in Japanese language
MIT License
309 stars 34 forks source link

Incorporate jmnedict database #3

Open tshatrov opened 9 years ago

tshatrov commented 9 years ago

Lately the few placenames etc. that exist in jmdict are being moved to jmnedict. If this continues, ichi.moe won't be able to recognize stuff like Tokyo etc., which is unacceptable. We need to incorporate jmnedict names without messing up the segmenting algorithm. Kanji names should be top priority, katakana names are not important and can be ignored for now. They should score lower than regular words so as not to pollute the results.

buster-blue commented 4 years ago

Any updates on this? I don't know much about databases, but I feel like this wouldn't be too hard to do and it would make the parser much more useful, since it wouldn't just break whenever it came across proper nouns anymore. I'm just curious because the issue is still open, but it's from 5 years ago. If you've just been too busy, that's fine, or maybe it's harder to do than I thought.

tshatrov commented 4 years ago

I decided not to do this because it would likely degrade segmenting a lot. Proper nouns can't be consistently romanized anyway. I'll be adding things that can be romanized such as place names separately. For example I already added all municipalities that currently exist in Japan. I'll be looking for other databases that I can incorporate without breaking too much stuff. But regarding jmnedict integration by all means, pull requests are welcome.