Open tshatrov opened 9 years ago
Any updates on this? I don't know much about databases, but I feel like this wouldn't be too hard to do and it would make the parser much more useful, since it wouldn't just break whenever it came across proper nouns anymore. I'm just curious because the issue is still open, but it's from 5 years ago. If you've just been too busy, that's fine, or maybe it's harder to do than I thought.
I decided not to do this because it would likely degrade segmenting a lot. Proper nouns can't be consistently romanized anyway. I'll be adding things that can be romanized such as place names separately. For example I already added all municipalities that currently exist in Japan. I'll be looking for other databases that I can incorporate without breaking too much stuff. But regarding jmnedict integration by all means, pull requests are welcome.
Lately the few placenames etc. that exist in jmdict are being moved to jmnedict. If this continues, ichi.moe won't be able to recognize stuff like Tokyo etc., which is unacceptable. We need to incorporate jmnedict names without messing up the segmenting algorithm. Kanji names should be top priority, katakana names are not important and can be ignored for now. They should score lower than regular words so as not to pollute the results.