reynoldsnlp / udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
GNU General Public License v3.0
26 stars 1 forks source link
accented-characters dependency-parser disambiguation finite-state-machine finite-state-morphology finite-state-transducers fst learner-errors lemmatization morphological-analysis morphological-disambiguator morphological-generation natural-language-processing nlp pos-tagger pos-tagging russian russian-language russian-morphology stressed-wordforms

UDAR(enie)

Actions Status codecov

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

A python wrapper for the Russian finite-state transducer described originally in chapter 2 of my dissertation.

If you use this work in your research please cite the following:


Reynolds, Robert J. "Russian natural language processing for computer-assisted language learning: capturing the benefits of deep morphological analysis in real-life applications" PhD Diss., UiT–The Arctic University of Norway, 2016. https://hdl.handle.net/10037/9685

Feature requests, issues, and pull requests are welcome!

Dependencies

For all features to be available, you should have hfst and vislcg3 installed as command-line utilities. Specifically, hfst is needed for FST-based tokenization, and vislcg3 is needed for grammatical disambiguation. The version used to successfully test the code is included in each commit in this file. The recommended method for installing these dependencies is as follows:

Debian / Ubuntu

$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install cg3 hfst python3-hfst

MacOS (Python 3.6/3.7 only)

On MacOS, one of udar's dependencies, the python package hfst, is not currently available for Python 3.8+. Hopefully, this will be remedied soon.

$ curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash
$ python3 -m pip install hfst

Installation

This package can be installed from PyPI using the usual...

$ python3 -m pip install --user udar

...or directly from this repository using...

$ python3 -m pip install --user git+https://github.com/reynoldsnlp/udar

Introduction

NB! Documentation is currently limited to docstrings. I recommend that you use help() frequently to see how to use classes and methods. For example, to see what options are available for building a Document, try help(Document).

The most common use-case is to use the Document constructor to automatically tokenize and analyze a text. If you print() a Document object, the result is an XFST/HFST stream:

import udar
doc1 = udar.Document('Мы удивились простоте системы.')
print(doc1)
# Мы    мы+Pron+Pers+Pl1+Nom    0.000000
#
# удивились удивиться+V+Perf+IV+Pst+MFN+Pl  5.078125
#
# простоте  простота+N+Fem+Inan+Sg+Dat  4.210938
# простоте  простота+N+Fem+Inan+Sg+Loc  4.210938
#
# системы   система+N+Fem+Inan+Pl+Acc   5.429688
# системы   система+N+Fem+Inan+Pl+Nom   5.429688
# системы   система+N+Fem+Inan+Sg+Gen   5.429688
#
# . .+CLB   0.000000

Passing the argument disambiguate=True, or running doc1.disambiguate() after the fact will run a Constraint Grammar to remove as many ambiguous readings as possible. This grammar is far from complete, so some ambiguous readings will remain.

Data objects

Document object

Property Type Description
text str Original text of this document
sentences List[Sentence] List of sentences in this document
num_tokens int Number of tokens in this document
features tuple udar.features.FeatureExtractor stores extracted features here

Document objects have convenient methods for adding stress or converting to phonetic transcription.

Method Return type Description
stressed str The original text of the document with stress marks
phonetic str The original text converted to phonetic transcription
transliterate str The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate None Disambiguate readings using the Constraint Grammar
cg3_str str Analysis stream in the VISL-CG3 format
from_cg3 Document Create Document from VISL-CG3 format stream
hfst_str str Analysis stream in the XFST/HFST format
from_hfst Document Create Document from XFST/HFST format stream
to_dict list Convert to a complex list object
to_html str Convert to HTML with markup in data- attributes
to_json str Convert to a JSON string

Examples

stressed_doc1 = doc1.stressed()
print(stressed_doc1)
# Мы́ удиви́лись простоте́ систе́мы.

ambig_doc = udar.Document('Твои слова ничего не значат.', disambiguate=True)
print(sorted(ambig_doc[1].stresses()))  # Note that слова is still ambiguous
# ['сло́ва', 'слова́']

print(ambig_doc.stressed(selection='safe'))  # 'safe' skips сло́ва and слова́
# Твои́ слова ничего́ не зна́чат.
print(ambig_doc.stressed(selection='all'))  # 'all' combines сло́ва and слова́
# Твои́ сло́ва́ ничего́ не зна́чат.
print(ambig_doc.stressed(selection='rand') in {'Твои́ сло́ва ничего́ не зна́чат.', 'Твои́ слова́ ничего́ не зна́чат.'})  # 'rand' randomly chooses between сло́ва and слова́
# True

phonetic_doc1 = doc1.phonetic()
print(phonetic_doc1)
# мы́ уд'ив'и́л'ис' пръстʌт'э́ с'ис'т'э́мы.

Sentence object

Property Type Description
doc Document "Back pointer" to the parent document of this sentence
text str Original text of this sentence
tokens List[Token] The list of tokens in this sentence
id str (optional) Sentence id, if assigned at creation
Method Return type Description
stressed str The original text of the sentence with stress marks
phonetic str The original text converted to phonetic transcription
transliterate str The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate None Disambiguate readings using the Constraint Grammar
cg3_str str Analysis stream in the VISL-CG3 format
from_cg3 Sentence Create Sentence from VISL-CG3 format stream
hfst_str str Analysis stream in the XFST/HFST format
from_hfst Sentence Create Sentence from XFST/HFST format stream
to_dict list Convert to a complex list object
to_html str Convert to HTML with markup in data- attributes
to_json str Convert to a JSON string

Token object

Property Type Description
id str The index of this token in the sentence, 1-based
text str The original text of this token
misc str Miscellaneous annotations with regard to this token
lemmas Set[str] All possible lemmas, based on remaining readings
readings List[Reading] List of readings not removed by the Constraint Grammar
removed_readings List[Reading] List of readings removed by the Constraint Grammar head int The id of the syntactic head of this token in the sentence, 1-based (0 is reserved for an artificial symbol that represents the root of the syntactic tree).
deprel str The dependency relation between this word and its syntactic head. Example: ‘nmod’.
Method Return type Description
stresses Set[str] All possible stressed wordforms, based on remaining readings
stressed str The original text of the sentence with stress marks
phonetic str The original text converted to phonetic transcription
most_likely_reading Reading "Most likely" reading (may be partially random selection)
most_likely_lemmas List[str] List of lemma(s) from the "most likely" reading
transliterate str The original text converted to Romanized Cyrillic (default=Scholarly)
force_disambiguate None Fully disambiguate readings using methods other than the Constraint Grammar
cg3_str str Analysis stream in the VISL-CG3 format
hfst_str str Analysis stream in the XFST/HFST format
to_dict dict Convert to a dict object
to_html str Convert to HTML with markup in data- attributes
to_json str Convert to a JSON string

Reading object

Property Type Description
subreadings List[Subreading] Usually only one subreading, but multiple subreadings are possible for complex Tokens.
lemmas List[str] Lemmas from all subreadings
grouped_tags List[Tag] The part-of-speech, morphosyntactic, semantic and other tags from all subreadings
weight str Weight indicating the likelihood of the reading, without respect to context
cg_rule str Reference to the rule in the constraint grammar that removed/selected/etc. this reading. If no action has been taken on this reading, then ''.
is_most_likely bool Indicates whether this reading has been selected as the most likely reading of its Token. Note that some selection methods may be at least partially random.
Method Return type Description
cg3_str str Analysis stream in the VISL-CG3 format
hfst_str str Analysis stream in the XFST/HFST format
generate str Generate the wordform from this reading
replace_tag None Replace a tag in this reading
does_not_conflict bool Determine whether reading from external tagset (e.g. Universal Dependencies) conflicts with this reading
to_dict list Convert to a list object
to_json str Convert to a JSON string

Subreading object

Property Type Description
lemma str The lemma of the subreading
tags List[Tag] The part-of-speech, morphosyntactic, semantic and other tags
tagset Set[Tag] Same as tags, but for faster membership testing (in Reading)
Method Return type Description
cg3_str str Analysis stream in the VISL-CG3 format
hfst_str str Analysis stream in the XFST/HFST format
replace_tag None Replace a tag in this reading
to_dict dict Convert to a dict object
to_json str Convert to a JSON string

Tag object

Property Type Description
name str The name of this tag
ms_feat str Morphosyntactic feature that this tag is associated with (e.g. Dat has ms_feat CASE)
detail str Description of the tag's purpose or meaning
is_L2_error bool Whether this tag indicates a second-language learner error
Method Return type Description
info str Alias for Tag.detail

Convenience functions

A number of functions are included, both for convenience, and to give concrete examples for using the API.

noun_distractors()

This function generates all six cases of a given noun. If the given noun is singular, then the function generates singular forms. If the given noun is plural, then the function generates plural forms. Such a list can be used in a multiple-choice exercise, hence the name distractors.

sg_paradigm = udar.noun_distractors('словом')
print(sg_paradigm == {'сло́ву', 'сло́ве', 'сло́вом', 'сло́ва', 'сло́во'})
# True

pl_paradigm = udar.noun_distractors('словах')
print(pl_paradigm == {'слова́м', 'слова́', 'слова́х', 'слова́ми', 'сло́в'})
# True

If unstressed forms are desired, simply pass the argument stressed=False.

diagnose_L2()

This function will take a text string as the argument, and will return a dictionary of all the types of L2 errors in the text, along with examples of the error.

diag = udar.diagnose_L2('Етот малчик говорит по-русски.')
print(diag == {'Err/L2_e2je': {'Етот'}, 'Err/L2_NoSS': {'малчик'}})
# True

tag_info()

This function will look up the meaning of any tag used by the analyzer.

print(udar.tag_info('Err/L2_ii'))
# L2 error: Failure to change ending ие to ии in +Sg+Loc or +Sg+Dat, e.g. к Марие, о кафетерие, о знание

Using the transducers manually

The transducers come in two varieties: the Analyzer class and the Generator class. For memory efficiency, I recommend using the get_analyzer and get_generator functions, which ensure that each flavor of the transducers remains a singleton in memory.

Analyzer

The Analyzer can be initialized with or without analyses for second-language learner errors using the keyword L2_errors.

analyzer = udar.get_analyzer()  # by default, L2_errors is False
L2_analyzer = udar.get_analyzer(L2_errors=True)

Analyzers are callable. They take a token str and return a sequence of reading/weight tuples.

raw_readings1 = analyzer('сло́ва')
print(raw_readings1)
# (('слово+N+Neu+Inan+Sg+Gen', 5.9755859375),)

raw_readings2 = analyzer('слова')
print(raw_readings2)
# (('слово+N+Neu+Inan+Pl+Acc', 5.9755859375), ('слово+N+Neu+Inan+Pl+Nom', 5.9755859375), ('слово+N+Neu+Inan+Sg+Gen', 5.9755859375))

Generator

The Generator can be initialized in three varieties: unstressed, stressed, and phonetic.

generator = udar.get_generator()  # unstressed by default
stressed_generator = udar.get_generator(stressed=True)
phonetic_generator = udar.get_generator(phonetic=True)

Generators are callable. They take a Reading or raw reading str and return a surface form.

print(stressed_generator('слово+N+Neu+Inan+Pl+Nom'))
# слова́

Working with Tokens and Readingss

You can easily check if a morphosyntactic tag is in a Token, Reading, or Subreading using in:

token2 = udar.Token('слова', analyze=True)
print(token2)
# слова [слово_N_Neu_Inan_Pl_Acc  слово_N_Neu_Inan_Pl_Nom  слово_N_Neu_Inan_Sg_Gen]

print('Gen' in token2)  # do any of the readings include Genitive case?
# True

print('слово' in token2)  # does not work for lemmas; use `in Token.lemmas`
# False

print('слово' in token2.lemmas)
# True

You can make a filtered list of a Token's readings using the following idiom:

pl_readings = [reading for reading in token2 if 'Pl' in reading]
print(pl_readings)
# [Reading(слово+N+Neu+Inan+Pl+Acc, 5.975586, ), Reading(слово+N+Neu+Inan+Pl+Nom, 5.975586, )]

Related projects

Finite-state tools

Russian morphological analysis