marytts / marytts-lang-hsb

Upper Sorbian language component for MaryTTS
GNU Lesser General Public License v3.0
2 stars 2 forks source link

Add further tests for number expansion: ordinals #6

Open aStereoID opened 3 years ago

aStereoID commented 3 years ago

@JMK-CB @aStereoID Thanks very much for your help!

I think we need to follow up on this to get more flexibility, especially regarding real numbers (in addition to natural numbers), ordinals (in addition to cardinals), and special numbers such as years. We should do that in a new issue.

Originally posted by @psibre in https://github.com/psibre/marytts-lang-hsb/issues/2#issuecomment-792598040

aStereoID commented 3 years ago

Since ordinals are already included in formatRules.txt we might start here easiest:

@JMK-CB : could you please check them again? @psibre : I guess there have to be additions in the preprocess made? And another test set?

How should that look like? default would be the (nominative) maskuline form , like

  1. = prêni
  2. = druhi

For feminine and neutral resolution you would have to consider the last letter of the following noun?

  1. šula = prenja šula
  2. słowo = druhe słowo
aStereoID commented 3 years ago

What about the other cases? So far we have only nominative..

psibre commented 3 years ago

Thanks for opening this and the info!

I'm writing integration tests that should make it easier to specify the desired behavior. Unfortunately, the hard-wired input/output module chain in MaryTTS is this (extracted from DEBUG logs):

JTokeniser converts RAWMARYXML into TOKENS
Preprocess converts TOKENS into WORDS
OpenNLPPosTagger converts WORDS into PARTSOFSPEECH
JPhonemiser converts PARTSOFSPEECH into PHONEMES

There doesn't seem to be an intuitive way of handling number expansion at the TOKENS stage if it requires morphosyntactic analysis, unless we overload the Preprocess module with all kinds of magic... which would also require all manner of NLP resources which I doubt exist for Sorbian. And I fear it would lead to feature creep.

However, we can definitely move forward with simple things, and then reassess.

psibre commented 3 years ago

Incidentally, @JMK-CB how should real numbers like 1,14159 be spelled out? What's the word for "comma", or is it a period instead?

JMK-CB commented 3 years ago

I have composed a list of test sentences for ordinal numbers combined with the different cases. It probably is not realistically possible to solve all those specific cases but we could try to tackle some of them (some don´t really occur very often anyway).

I am also compiling a similar list for special cases with cardinal numbers. Testsätze für Ordinalia.zip

JMK-CB commented 3 years ago

Incidentally, @JMK-CB how should real numbers like 1,14159 be spelled out? What's the word for "comma", or is it a period instead?

comma is "koma" in both Sorbian languages.

As far as I can assess this I´d say real numbers should not be a problem because Sorbian simply counts the numbers one by one without any modification by cases or similar. So your example should always result in "jedyn koma jedyn štyri jedyn pjeć dźewjeć".

However, Astrid has directed my attention to fractions. Those won´t be just that easy to handle but I will have a look at it and compile a list of testable cases.

JMK-CB commented 3 years ago

@JMK-CB : could you please check them again?

Those are correct.

aStereoID commented 3 years ago

Incidentally, @JMK-CB how should real numbers like 1,14159 be spelled out? What's the word for "comma", or is it a period instead?

comma is "koma" in both Sorbian languages.

As far as I can assess this I´d say real numbers should not be a problem because Sorbian simply counts the numbers one by one without any modification by cases or similar. So your example should always result in "jedyn koma jedyn štyri jedyn pjeć dźewjeć".

@psibre: So decimals could be one of the "simple things"?

aStereoID commented 3 years ago

Now I'm unsure about the cases, Jan's list scares me ;-) And it doesn't only concern ordinals.. Maybe it's better to set a default variant (@JMK-CB: usually nominative masculine?) and otherwise recommend to spell it out? Considering that "Prěnja žona w swětnišću.." (Die erste Frau im Weltraum..) is more obvious than "1. žona w swětnišću.." (Die 1. Frau im Weltall..)

Shall we discuss this tomorrow in Zoom?

psibre commented 3 years ago

Thanks for the details and feedback!

Regarding the real numbers, that's something I expect to easily solve later today.

The list of sentences with ordinal numbers is a great resource, and we can use it to investigate how to support those linguistic cases (pun intended).

JMK-CB commented 3 years ago

I agree with Astrid, the default should be nominative masculine because thats probably what people would expect as a technical default. Other forms could be felt as erroneous.

However, I think it would be great to at least be able to recognize the grammatical gender the numbers refer to. I think a wrong gender could be more confusing (because of other nouns in the context) than expecting a case ending but getting nominative.

aStereoID commented 3 years ago

I hope we can get some momentum back into this topic :-)

At the moment the number expansion is only done for cardinals (nominative masculine). To include the pronunciation of years (#5) and at least the nominative default version of the ordinalia, perhaps we could proceed similarly to Lb? They also use the Rule Based Number Format with:

String formatRules
final String cardinalRule
final String ordinalRule
final String ordinalFemaleRule
final String ordinalNeutrumRule
final String yearRule

where

cardinalRule = "%spellout-numbering"
ordinalRule = "%spellout-ordinal-maskulinum"
ordinalFemaleRule = "%spellout-ordinal-femininum"
ordinalNeutrumRule = "%spellout-ordinal-neutrum"
yearRule = "%spellout-numbering-year"

Possible regexes should be (I tried java notation):

// year
pattern = "(?<=\D)(1[1-9]\d\d)(?=\D)"
// ordinal female (nominative): in most cases the following word ends with -a
pattern = "(\d+\.)(?=\s\b\w+?a\b)"
// ordinal neutrum (nominative): in most cases the following word ends with -o or -e
pattern = "(\d+\.)(?=\s\b\w+?[o|e]\b)"

All other cases of "\d+." would be the else variant and should be expanded according to the ordinalRule (=%spelloutspellout-ordinal-maskulinum)

@JMK-CB : Plural nominative seems to behave like neutrum? Only maybe add another ordinalPluralRule if the following word ends with -i?

I'm not sure how and where exactly implement these if's and else's in the Preprocess-file, so @psibre maybe you can help?