mocobeta / janome

Japanese morphological analysis engine written in pure Python
https://mocobeta.github.io/janome/en/
Apache License 2.0
856 stars 51 forks source link

Add progress indicator to UserDictionary #87

Closed uezo closed 4 years ago

uezo commented 4 years ago

Usage

Create instance of SimpleProgressIndicator and pass it as argument like below:

from janome.dic import UserDictionary, SimpleProgressIndicator
from janome import sysdic

# create progress indicator
progress_indicator = SimpleProgressIndicator(update_frequency=0.001)

# create user dictionay
user_dict = UserDictionary(
    user_dict="/path/to/userdict.csv",
    enc="utf8",
    type="ipadic",
    connections=sysdic.connections,
    progress_handler=SimpleProgressIndicator()  # pass progress indicator here
)

# save dictionary
user_dict.save("/path/to/save")

Then, progress will be indicated.

Reading user dictionary from CSV: 100.0% | 152869/152869
Running create_minimum_transducer: 19.0% | 29000/152869

Custom indicator

You can also use your own custom indicator. Here is the example using tqdm.

from janome.progress import ProgressHandler
from tqdm import tqdm

class TQDMProgressIndicator(ProgressHandler):
    def __init__(self, update_frequency=0.1, tqdm=None):
        self.update_frequency = update_frequency
        self.tqdm = tqdm
        self.total = None
        self.value = None
        self.desc = None

    def on_start(self, total, value=0, desc=None):
        self.total = total
        self.value = value
        self.desc = desc or "Processing"
        self.tqdm = None

    def on_progress(self, value=1):
        self.value += value

        if self.tqdm is None:
            self.tqdm = tqdm(total=self.total, desc=self.desc)

        if float.is_integer(self.value * self.update_frequency):
            self.tqdm.update(int(1 / self.update_frequency))

    def on_complete(self):
        self.tqdm.n = self.value
        self.tqdm.update(0)
        self.tqdm.close()
        self.total = self.value = self.desc = self.tqdm = None
mocobeta commented 4 years ago

Looks cool!

mocobeta commented 4 years ago

In general your patch looks good to me, thanks.

Also I think it would be great if you could add some tests for UserDictionary class in tests/test_dic.py. There are a few tests for it already, you could add additional tests there with mock ProgressHandler. But it's optional and can be delayed.

mocobeta commented 4 years ago

Seems tests on Windows fail. I'll take a look. CI pipelines on Linux are fine: https://travis-ci.org/github/mocobeta/janome/builds/722745085

mocobeta commented 4 years ago

@uezo about tests for UserDictionary: please add separated test methods for the indicators, instead of modifying existing test cases. Each test case should be kept as small as possible (and has only one concern) for maintainability.

mocobeta commented 4 years ago

Especially for Windows OS, encoding option is mandatory. I also use context manager to ensure close opened user dict file. https://github.com/mocobeta/janome/pull/87/commits/c0b7ff427bbff1e6bca28614b88c5d671c06011d

mocobeta commented 4 years ago

Hi @uezo, I made a couple of fixes on the branch (by utilizing my owner privilege), please take a look. I think this is now ready to merge to upstream.

uezo commented 4 years ago

Thank you for fix! Everything looks fine! Sorry for missing to close the file...

mocobeta commented 4 years ago

Also I updated the documentation: https://github.com/mocobeta/janome/pull/87/commits/806d6240952d63ed8f4231a909d1eb9db50f3670

(ja) Screenshot from 2020-09-06 23-01-30

(en) Screenshot from 2020-09-06 23-03-11

mocobeta commented 4 years ago

Thank you @uezo for the nice patch!