sebischair / SimpleNLG-DE

German version of SimpleNLG 4
https://wwwmatthes.in.tum.de/
Other
18 stars 1 forks source link

Singular noun variants are ignored during indexing of XMLLexicon #4

Open vzam opened 8 months ago

vzam commented 8 months ago

When the XMLLexicon is created, I would expect that the variants which I have configured will be indexed and the words can be found by createWord through their variants, which is not always the case.

I have found, that the dative_sin, genitive_sin and akkusative_sin are not correctly indexed for some words, and that it is a coincidence that the indexing works in most cases just because of the good heuristics that were defined in MorphologyRules.

Here is an example:

<word>
  <baseForm>Band</baseForm>
  <genus>n</genus>
  <genitive_sin>Bandes</genitive_sin>
</word>

Please note that I am using Bandes instead of Bands, which is the same word in a more sophisticated form. During the creation of the XMLLexicon, SimpleNLG-DE will go through all variants of each word and put them into the index, but instead of choosing the genitive_sin that should be used for DiscourseFunction.GENITIVE + NumberAgreement.SINGULAR, the heuristics are used, generating Bands. So instead of Bandes, we get the index for Bands.

I think there are several issues contributing to that:

The getVariants method creates an InflectedWordElement https://github.com/sebischair/SimpleNLG-DE/blob/5c831cb9722406c749bc00bdd867e4d694e4bb4a/src/main/java/simplenlgde/lexicon/XMLLexicon.java#L251

which does not receive the genus property from the base word. https://github.com/sebischair/SimpleNLG-DE/blob/5c831cb9722406c749bc00bdd867e4d694e4bb4a/src/main/java/simplenlgde/framework/InflectedWordElement.java#L65C5-L73C6

Then that inflected word is used in https://github.com/sebischair/SimpleNLG-DE/blob/5c831cb9722406c749bc00bdd867e4d694e4bb4a/src/main/java/simplenlgde/morphology/MorphologyRules.java#L70 where the genus will always be null. Later in https://github.com/sebischair/SimpleNLG-DE/blob/5c831cb9722406c749bc00bdd867e4d694e4bb4a/src/main/java/simplenlgde/morphology/MorphologyRules.java#L184 the genus is used as a condition (one of many) to use heuristics or not, but it will be always null so the genitive_sin would be ignored already. But even if the genus was properly set, the genitive_sin that was specified in the lexicon would be ignored, because here https://github.com/sebischair/SimpleNLG-DE/blob/5c831cb9722406c749bc00bdd867e4d694e4bb4a/src/main/java/simplenlgde/morphology/MorphologyRules.java#L208 the element is the InflectedWordElement, which also does not have any of the variants from the lexicon that were set on the base word.

I have not analysed the impact of this on other components of the library, but I think that if the word is created using the base form Band and then changed to DiscourseFunction.GENITIVE and NumberAgreement.Singular, it should be correctly realised as Bandes because the variant would be taken from the base word.

My suggestions are to get the genus from the baseWord instead of the InflectedWordElement in doNounMorphology and pass the baseWord instead of the element to the doNounMorphologySingular. I have tested these suggestions in my own environment and had it working with all the tests passing, but then again, I don't know if something else using doNounMorphology could break as I did not analyse that.

DaBr01 commented 8 months ago

Thanks for the report. I will have to look into this in more detail, but I can already tell that the problem seems to be a bit different.

"Bandes" is the standard form for genitive singular that is also used in the default lexicon. The following code produces the output "Ich sehe die Farbe des Bandes".

Lexicon lexicon = Lexicon.getDefaultLexicon();
NLGFactory nlgFactory = new NLGFactory(lexicon);
Realiser realiser = new Realiser(lexicon);

SPhraseSpec sentence = nlgFactory.createClause();
sentence.setSubject("ich");
sentence.setVerb("sehen");
NPPhraseSpec farbe = nlgFactory.createNounPhrase("die", "farbe");
NPPhraseSpec band = nlgFactory.createNounPhrase("das", "band");
band.setFeature(InternalFeature.CASE, DiscourseFunction.GENITIVE);
farbe.addComplement(band);
sentence.setObject(farbe);

String output = realiser.realiseSentence(sentence);
System.out.println(output);

(There is nevertheless something wrong with loading the information for the inflected forms.)

The reason why you get the form "Bands" instead of "Bandes" with your custom lexicon entry seems to be that the lexicon entry is incomplete. I will have to double check but it looks like SimpleNLG-DE is defaulting back to rules if a noun entry only contains an entry for one case because the lexicon entry is considered incomplete. (Whether that makes sense would be the next question.)