This is yet another rule-based morphological analysis tool. No built-in rules are provided; you will have to write some if you want to parse texts in your language. Uniparser-morph was developed primarily for under-resourced languages, which don't have enough data for training statistical parsers. Here's how it's different from other similar tools:
uniparser-morph
rules is certainly regular, the description is actually NOT entirely converted into an FST. Therefore, it's not nearly as fast as FST-based analyzers. The speed varies depending on the language structure and hardware characteristics, but you can hardly expect to parse more than 20,000 words per second. For heavily polysynthetic languages that figure can go as low as 200 words per second. So it's not really designed for industrial use.The primary usage scenario I was thinking about is the following:
uniparser-morph
format (probably making use of existing digital dictionaries of the language).Of course, you can do other things with uniparser-morph
, e.g. make it a part of a more complex NLP pipeline; just make sure low speed is not an issue in your case.
uniparser-morph
is distributed under the MIT license (see LICENSE).
Import the Analyzer
class from the package. Here is a basic usage example:
from uniparser_morph import Analyzer
a = Analyzer()
# Put your grammar files in the current folder or set paths as properties of the Analyzer class (see below)
a.load_grammar()
analyses = a.analyze_words('Морфологиез')
# The parser is initialized before first use, so expect some delay here (usually several seconds)
# You will get a list of Wordform objects
# You can also pass lists (even nested lists) and specify output format ('xml' or 'json'):
analyses = a.analyze_words([['А'], ['Мон', 'тонэ', 'яратӥсько', '.']], format='xml')
analyses = a.analyze_words(['Морфологиез', [['А'], ['Мон', 'тонэ', 'яратӥсько', '.']]], format='json')
If you need to parse a frequency list, use analyze_wordlist()
instead.
See the documentation for the full list of options.
If you want to create a uniparser-morph
analyzer for your language, you will have to write a set of rules that describe the vocabulary and the morphology of your language in uniparser-morph
format. For the description of the format, refer to documentation .
If you have disambiguation rules in the Constraint Grammar format, you can use them in the following way when calling analyze_words()
:
analyses = a.analyze_words(['Мон', 'морфологиез', 'яратӥсько', '.'],
cgFile=os.path.abspath('disambiguation.cg3'),
disambiguate=True)
In order for this to work, you have to install the cg3
executable separately. On Ubuntu/Debian, you can use apt-get
:
sudo apt-get install cg3
On Windows, download the binary and add the path to the PATH
environment variable. See the documentation for other options.
Note that each time you call analyze_words()
with disambiguate=True
, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.