How to use the analyzer within the python code?

google-research / turkish-morphology

A two-level morphological analyzer for Turkish.

Apache License 2.0

166 stars 28 forks source link

How to use the analyzer within the python code? #1

Closed andmek closed 4 years ago

andmek commented 5 years ago

How can I use the analyzer within the python code, something like print(analyze('geldiğinde'))

ozturel commented 5 years ago

Right now we do not have a top level Python API, where you can get the analyses with a straightforward function call. Although, this is a TODO and it will be added.

Until then you can use the print_analyses script to get the analyzer output (see how it's done in turkish-morphology/analyzer/evaluator/evaluate.py).

melanuria commented 4 years ago

Using the print_analyses script from within Python seems to be quite slow (around one word-form per second). Is there a faster way of analyzing word-forms?

Note: I'm using a very old computer (Intel Core 2 Duo, 8 GB of RAM, running on Lubuntu 19.10) to do something like "subprocess.check_output(".../bazel-bin/scripts/print_analyses --word=kestiriyorduk").

ozturel commented 4 years ago

Yes, running the script would be slow if you are using it to analyze words in bulk. It is intended to be used for causal one-off analysis.

Calling print_analyses with subprocess.check_output would especially be slow, since print_analyses script would read and load the FAR which contains the morphological analyzer FST for each word (not even mentioning the overhead for starting a new process). You can try to modify the script in a way to accept more than one input word, and output analyses for each in bulk. But that would just be hack, not sure whether it would be a convenient solution for your use case.

In any case, please subscribe to this issue. We will soon push a native Python API to this repo that will have functions to run the analyzer from Python source.

ozturel commented 4 years ago

There is now a Python API (surface_form() function of //lib:analyze.py), which you can use to run the analyzer over Turkish words within Python code.

Please see //scripts/print_analyses.py for an example use case. _evaluate() function of //scripts/evaluate_analyzer.py also has a use case with parallelization over multiple CPUs.

We are planning to expand the API and also to make this project available over PyPi. Therefore, I'm not closing this issue for now.

ozturel commented 4 years ago

The API is now also released through PyPi. You can find the installation notes in the README.