Closed albbas closed 11 years ago
Date: 2012-10-29 11:41:29 +0100
From: Sjur Nørstebø Moshagen <
Date: 2012-10-29 11:44:26 +0100
From: Thomas Omma <
interesting indeed!
Date: 2012-10-29 11:52:46 +0100
From: Sjur Nørstebø Moshagen <
There was supposed to be a description of the bug when originally reported. Here it comes:
I have been debugging errors in the noun lemma generation testing for SMA, and then SMJ. As the bug in the test bench got fixed, some interesting things in the SMJ lexicon and twolc files popped up. The following lemmas are presently not recognized:
HFST - lemmas with spaces:
club music football teama world club
This is an HFST bug, and can be IGNORED for now. HFST does not accept input strings with spaces in them, at least not in the same way as Xerox' lookup utility does. It needs to be fixed by the HFST team.
BOTH:
dållågát energi fiervvágát gåjkkegát jávrregát jåhkågát kondom merragát nuorregát nuppegiel nuppegiela vuodnagát ánársámegiel ædnogát
These are probably regular LexC entry bugs. Inga should have a look at them.
XEROX:
exhibitionist existensialist exorsist exotist expressionist extremist katolicist neoklassicist revanchist suksess
These are TWOLC bugs, and probably the result of rule conflicts. The reason only XEROX barks at these, is that hfst interpretes the twolc rule conflict differently, giving both the correct AND the wrong behavior, whereas Xerox ONLY gives the wrong behavior.
The bug is that the last letter (-t or -s) is deleted by some rule, which of course destroys the generated lemma. This part of the bug should be fixed in cooperation between Inga, Thomas and me. I probably want to verify that the HFST interpretation of the rule(s) is correct before we change the actual rule(s), so that possible bugs in HFST can be fixed. There are known discrepancies between Xerox and HFST wrt compilation and conflict resolutions in two-level rules, where the HFST team claims that Xerox misbehaves for a certain type of conflicts. I don't yet know whether this is such a case.
Date: 2012-10-29 12:09:59 +0100
From: Thomas Omma <
smj $ svn ci -m "changed st-final prefixes to st9, changed lemma for gat-prefixes to gatt, reflecting the output-part" src/morphology/stems/nouns.lexc Sending src/morphology/stems/nouns.lexc Transmitting file data . Committed revision 64612.
About the -gátt- versus -gát- Ingá has to decide on normativity:
dållågáttmuorra dållågáttmuorra dållågátt+N+Cmp#muorra+N+Sg+Nom
Date: 2012-10-29 12:12:09 +0100
From: Thomas Omma <
but anyways, now we got:
smj $ usmjNorm 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% exhibitionist- exhibitionist- exhibitionist+N+RCmpnd
neoklassicist- neoklassicist- neoklassicist+N+RCmpnd
dållågátt- dållågátt- dållågátt+N+RCmpnd
etc
Date: 2012-10-29 12:15:27 +0100
From: Thomas Omma <
these are Sub:
smj $ usmj 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% energi- energi- energi+N+RCmpnd
kondom- kondom- kondåvmmå+N+SgNomCmp+RCmpnd kondom- kondom+N+RCmpnd
ánársámegiel ánársámegiel anársámegiella+N+Attr
and these seem fine:
smj $ usmjNorm 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% nuppegiel- nuppegiel- nuppegiel+N+DefSgNomCmp+RCmpnd
nuppegiela- nuppegiela- nuppegiela+N+DefSgNomCmp+RCmpnd
Date: 2012-10-29 12:24:47 +0100
From: Thomas Omma <
smj $ svn ci -m "removed ánársámegiel prefix, it was redundant" src/morphology/stemsSending src/morphology/stems/nouns.lexcTransmitting file data . Committed revision 64613.
Date: 2012-10-29 21:17:21 +0100
From: Sjur Nørstebø Moshagen <
(In reply to comment #2)
HFST - lemmas with spaces:
club music football teama world club
This is an HFST bug, and can be IGNORED for now. HFST does not accept input strings with spaces in them, at least not in the same way as Xerox' lookup utility does. It needs to be fixed by the HFST team.
This is a known "feature" - it requires that the space is declared as an alphabet character in the twolc file. Solved in svn revision 64618.
After Thomas' changes, the only problem words left are:
energi kondom nuppegiel nuppegiela
(both Xerox and hfst). These are left for Inga.
And most importantly: Xerox and HFST now behaves exactly the same, in ALL tests.
Date: 2012-10-30 11:35:40 +0100
From: Inga Lill Sigga Mikkelsen <
energi kondom nuppegiel nuppegiela
energi and kondom had the Use/Sub tag. Compound loanwords quote the first word, so there is no need to sub mark these words. Entries like "alkohol" are not sub marked so energi and kondom shouldn't be.
nuppegiel and nuppegiela are genitive forms, and the test only accepts nominative entries. I marked these entries with !, because we don't want the entry "nuppegiella".
This issue was created automatically with bugzilla2github
Bugzilla Bug 1485
Date: 2012-10-29T11:41:29+01:00 From: Sjur Nørstebø Moshagen <>
To: Inga Lill Sigga Mikkelsen <>
CC: lene.antonsen, sjur.n.moshagen, thomas.omma, trond.trosterud
Last updated: 2012-10-30T11:35:40+01:00