google-code-export / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Several problems with AsvToolboxSplitterAlgorithm's handleLastSplit() method #571

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What version of the product are you using? On what operating system?
Relates to version currently browsable in google code. Linux

Issue 1.
boolean isInvStartsWith defined on line 418 is not used in any case, can be 
removed.
This 'isInvStartsWith' cannot happen as we always input the last segment of the 
compound as aSplit to the method - so either aSplit.startsWith(rest) or 
aSplit.startsWith(restGrund)

What steps will reproduce the problem?
1. comment out isInvStartsWith
2. test on a bunch of examples
3. no difference in behavior

What is the expected output? What do you see instead?

Output is fine, this is a sanity issue. If that variable has a role, I 
misunderstood the code completely.

Issue 2.
in line 416, isEqual should be:
boolean isEqual = /*aSplit.equals(restGrund) ||*/ aSplit.equals(rest);
i.e. not consider the 'equals restGrund' case.

This way, the last part of the compound is never lemmatized, which is, if 
desired then this is a non issue, but I find it counter intuitive (as typically 
the last part of the noun is what gets inflected...).
Sometimes, equality check also prevents reducing the inner part (see 2nd 
example below)

What steps will reproduce the problem?
1. comment out the part above
2. test on a bunch of examples
3. difference in behavior that last part of the compound gets lemmatized. I 
think this is desirable, and the inflection can be dropped entirely (as it is 
not a linking morpheme that should be annotated, but a standard inflection).

What is the expected output? What do you see instead?
    INPUT   DESIRED (in my opinion) OBSERVED
1.  Bankdienstleistungen    Bank+dienst+leistung    Bank+dienst+leistungen
2.  Fußbodenschleifmaschinenverleih    Fuß+boden+schleif+maschine+(n)+verleih Fuß
+boden+schleif+maschinen+verleih
3.  Halsschmerzen   Hals+schmerz    Hals+schmerzen
4.  Klimaschutzzielen   Klima+schutz+ziel   Klima+schutz+zielen
5.  Kopfschmerzen       Kopf+schmerz    Kopf+schmerzen

Issue 3.
If Issue 2 is approved and changed, this surfaces a bug in line 436 (which was 
not an issue before, when last part does not get lemmatized).

Namely, that this line assumes that the reduced (lemma) form is always strictly 
shorter or equal length as the inflected form. This is not always true, see 
below.

What steps will reproduce the problem?
1. Implement the change suggested in Issue 2, i.e. remove equals(restGrund) 
check.
2. test with "Betriebsmodi"
3. substring throws a StringIndexOutOfBoundsException

What is the expected output? What do you see instead?
Betriebsmodi    Betrieb+(s)+modus
isntead: exception thrown.

Fix: add a check around line 436:
                    //there is something at the end, this is not true for irregular cases where
                    //inflected form gets shortened: "modus" --> "modi" (plural)
                    if (rest.length() > restGrund.length()) {
                        retvec.add("(" + rest.substring(restGrund.length()) + ")");
                    }

Original issue reported on code.google.com by szarv...@amazon.de on 22 Dec 2014 at 11:03