mocobeta / janome

Japanese morphological analysis engine written in pure Python
https://mocobeta.github.io/janome/en/
Apache License 2.0
855 stars 51 forks source link

Show progress when compiling user dictionary #86

Closed uezo closed 4 years ago

uezo commented 4 years ago

It takes very long time when I compile Neologd as a user dictionary. Showing progress indicator especially for running create_minimum_transducer helps the users to decide to continue or abort compiling.

I will send pull request to solve this issue, however, I don’t have a confidence that this is the best way for this from the view point of architecture.

mocobeta commented 4 years ago

As for neologd, it is recommended that you use it as the "system dictionary", instead of adding whole entries to user dictionary. Neologd is too large to be added to the user dictionary - whole user dictionary data is loaded on the process space and it wastes large memory when too many entries are added. System dictionary is accessed via mmap, this is well adopted for large dictionaries.

There is procedure to build neologd based janome. See https://github.com/mocobeta/janome/wiki/(very-experimental)-NEologd-%E8%BE%9E%E6%9B%B8%E3%82%92%E5%86%85%E5%8C%85%E3%81%97%E3%81%9F-janome-%E3%82%92%E3%83%93%E3%83%AB%E3%83%89%E3%81%99%E3%82%8B%E6%96%B9%E6%B3%95 Also I recently uploaded a custom janome package that was built with the latest neologd dictionary; the google drive link is available from the wiki.

mocobeta commented 4 years ago

I will look at your PR this weekend, thanks.

uezo commented 4 years ago

Thank you for reply and introducing the way to use Neologd as system dictionary. I understood that if it takes so long to compile that I need a progress indicator, I should not select a user dictionary. If there are not any point to discuss, it is okay to close this issue.

Lastly, I'm looking forward to your session in PyCon JP this weekend😃