reynoldsnlp / udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
GNU General Public License v3.0
26 stars 1 forks source link

an example of ambiguity resolving in README #41

Closed katyamineeva closed 4 years ago

katyamineeva commented 4 years ago

Hi!

I think an example of ambiguity resolving might be helpful. For instance:

import udar

doc1 = udar.Document('Мне недостаточно просто твоего честного слова.')
doc2 = udar.Document('Красивые слова!')
doc3 = udar.Document('Твои слова ничего не значат.')

samples = [doc1, doc2, doc3]

for doc in samples:
  doc.disambiguate()
  print(doc.stressed())

prints out

Мне́ недоста́точно про́сто твоего́ честного сло́ва.
Краси́вые слова́!
Твои́ слова ничего́ не зна́чат.

So, in the first and the second sentences an ambiguity was resolved correctly, but ambiguity remains in the third one. It's also not clear that after calling the disambiguate method some words may remain unstressed (and no warning message is printed out). At first, I tried your code with sentences where the disambiguate method doesn't change anything and thought that this is a mistake or code is incomplete.

An thank you for you work!

reynoldsnlp commented 4 years ago

@katyamineeva thanks for the examples. This is something that I just need to explain better in the README (or in some real documentation, when I get to that stage). The issue is that disambiguate() (or Document(input, disambiguate=True), is not guaranteed to disambiguate all readings of a token. You can easily see which tokens are still ambiguous by doing print(doc), and you'll see that in the last Document, there are still 3 possible readings for слова. In other words, our constraint grammar still has a long way to go.

As for adding stress to ambiguous tokens, the default is doc.stressed(selection='safe'), which abstains from adding stress to tokens that have ambiguous stress. You can use doc.stressed(selection='rand') or doc.stressed(selection='all') to make sure the in-vocabulary words are marked with stress, even if they are still ambiguous. For out-of-vocabulary words, you can add guess=True, e.g. doc.stressed(selection='all', guess=True) to allow an "intelligent" algorithm to guess where to put stress. (The algorithm is not actually very good, but it's better than nothing...?).

I just barely pushed a fix for the 'all' method to the master branch, so be sure to pull the latest if you want to use that.

>>> import udar
>>>
>>> doc1 = udar.Document('Мне недостаточно просто твоего честного слова.')
>>> doc2 = udar.Document('Красивые слова!')
>>> doc3 = udar.Document('Твои слова ничего не значат.')
>>>
>>> samples = [doc1, doc2, doc3]
>>>
>>> for doc in samples:
...   doc.disambiguate()
...   print(doc.stressed(selection='all'))
...
Мне́ недоста́точно про́сто твоего́ че́стно́го сло́ва.
Краси́вые слова́!
Твои́ сло́ва́ ничего́ не зна́чат.

Note that in doc3, you get 'сло́ва́'.

reynoldsnlp commented 4 years ago

The fact that you had to ask a question is a result of the fact that these I have not documented these features well, so I'm going to leave this issue open as a reminder to improve the documentation. In the meantime, you can see docstrings for very basic documentation of many functions, e.g. help(udar.Token.stressed).

katyamineeva commented 4 years ago

Thank you for the clarification!