timarkh / uniparser-morph

Rule-based, linguist-friendly (and rather slow) morphological analysis
MIT License
5 stars 2 forks source link
linguistics morphological-analysis nlp pos-tagging rule-based

uniparser-morph

This is yet another rule-based morphological analysis tool. No built-in rules are provided; you will have to write some if you want to parse texts in your language. Uniparser-morph was developed primarily for under-resourced languages, which don't have enough data for training statistical parsers. Here's how it's different from other similar tools:

The primary usage scenario I was thinking about is the following:

Of course, you can do other things with uniparser-morph, e.g. make it a part of a more complex NLP pipeline; just make sure low speed is not an issue in your case.

uniparser-morph is distributed under the MIT license (see LICENSE).

Usage

Import the Analyzer class from the package. Here is a basic usage example:

from uniparser_morph import Analyzer
a = Analyzer()

# Put your grammar files in the current folder or set paths as properties of the Analyzer class (see below)
a.load_grammar()

analyses = a.analyze_words('Морфологиез')
# The parser is initialized before first use, so expect some delay here (usually several seconds)
# You will get a list of Wordform objects

# You can also pass lists (even nested lists) and specify output format ('xml' or 'json'):
analyses = a.analyze_words([['А'], ['Мон', 'тонэ', 'яратӥсько', '.']], format='xml')
analyses = a.analyze_words(['Морфологиез', [['А'], ['Мон', 'тонэ', 'яратӥсько', '.']]], format='json')

If you need to parse a frequency list, use analyze_wordlist() instead.

See the documentation for the full list of options.

Format

If you want to create a uniparser-morph analyzer for your language, you will have to write a set of rules that describe the vocabulary and the morphology of your language in uniparser-morph format. For the description of the format, refer to documentation .

Disambiguation with CG

If you have disambiguation rules in the Constraint Grammar format, you can use them in the following way when calling analyze_words():

analyses = a.analyze_words(['Мон', 'морфологиез', 'яратӥсько', '.'],
                           cgFile=os.path.abspath('disambiguation.cg3'),
                           disambiguate=True)

In order for this to work, you have to install the cg3 executable separately. On Ubuntu/Debian, you can use apt-get:

sudo apt-get install cg3

On Windows, download the binary and add the path to the PATH environment variable. See the documentation for other options.

Note that each time you call analyze_words() with disambiguate=True, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.