giellalt / lang-smj

Finite state and Constraint Grammar based analysers and proofing tools + language resources for Lule Sámi
https://giellalt.uit.no
GNU General Public License v3.0
2 stars 0 forks source link

Issues in the SMJ lexicon and twolc files ( #55

Closed albbas closed 11 years ago

albbas commented 11 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1485

Date: 2012-10-29T11:41:29+01:00 From: Sjur Nørstebø Moshagen <> To: Inga Lill Sigga Mikkelsen <> CC: lene.antonsen, sjur.n.moshagen, thomas.omma, trond.trosterud

Last updated: 2012-10-30T11:35:40+01:00

albbas commented 11 years ago

Comment 7204

Date: 2012-10-29 11:41:29 +0100 From: Sjur Nørstebø Moshagen <>

albbas commented 11 years ago

Comment 7205

Date: 2012-10-29 11:44:26 +0100 From: Thomas Omma <>

interesting indeed!

albbas commented 11 years ago

Comment 7206

Date: 2012-10-29 11:52:46 +0100 From: Sjur Nørstebø Moshagen <>

There was supposed to be a description of the bug when originally reported. Here it comes:

I have been debugging errors in the noun lemma generation testing for SMA, and then SMJ. As the bug in the test bench got fixed, some interesting things in the SMJ lexicon and twolc files popped up. The following lemmas are presently not recognized:

HFST - lemmas with spaces:

club music football teama world club

This is an HFST bug, and can be IGNORED for now. HFST does not accept input strings with spaces in them, at least not in the same way as Xerox' lookup utility does. It needs to be fixed by the HFST team.

BOTH:

dållågát energi fiervvágát gåjkkegát jávrregát jåhkågát kondom merragát nuorregát nuppegiel nuppegiela vuodnagát ánársámegiel ædnogát

These are probably regular LexC entry bugs. Inga should have a look at them.

XEROX:

exhibitionist existensialist exorsist exotist expressionist extremist katolicist neoklassicist revanchist suksess

These are TWOLC bugs, and probably the result of rule conflicts. The reason only XEROX barks at these, is that hfst interpretes the twolc rule conflict differently, giving both the correct AND the wrong behavior, whereas Xerox ONLY gives the wrong behavior.

The bug is that the last letter (-t or -s) is deleted by some rule, which of course destroys the generated lemma. This part of the bug should be fixed in cooperation between Inga, Thomas and me. I probably want to verify that the HFST interpretation of the rule(s) is correct before we change the actual rule(s), so that possible bugs in HFST can be fixed. There are known discrepancies between Xerox and HFST wrt compilation and conflict resolutions in two-level rules, where the HFST team claims that Xerox misbehaves for a certain type of conflicts. I don't yet know whether this is such a case.

albbas commented 11 years ago

Comment 7207

Date: 2012-10-29 12:09:59 +0100 From: Thomas Omma <>

smj $ svn ci -m "changed st-final prefixes to st9, changed lemma for gat-prefixes to gatt, reflecting the output-part" src/morphology/stems/nouns.lexc Sending src/morphology/stems/nouns.lexc Transmitting file data . Committed revision 64612.

About the -gátt- versus -gát- Ingá has to decide on normativity:

dållågáttmuorra dållågáttmuorra dållågátt+N+Cmp#muorra+N+Sg+Nom

albbas commented 11 years ago

Comment 7208

Date: 2012-10-29 12:12:09 +0100 From: Thomas Omma <>

but anyways, now we got:

smj $ usmjNorm 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% exhibitionist- exhibitionist- exhibitionist+N+RCmpnd

neoklassicist- neoklassicist- neoklassicist+N+RCmpnd

dållågátt- dållågátt- dållågátt+N+RCmpnd

etc

albbas commented 11 years ago

Comment 7209

Date: 2012-10-29 12:15:27 +0100 From: Thomas Omma <>

these are Sub:

smj $ usmj 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% energi- energi- energi+N+RCmpnd

kondom- kondom- kondåvmmå+N+SgNomCmp+RCmpnd kondom- kondom+N+RCmpnd

ánársámegiel ánársámegiel anársámegiella+N+Attr

and these seem fine:

smj $ usmjNorm 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% nuppegiel- nuppegiel- nuppegiel+N+DefSgNomCmp+RCmpnd

nuppegiela- nuppegiela- nuppegiela+N+DefSgNomCmp+RCmpnd

albbas commented 11 years ago

Comment 7210

Date: 2012-10-29 12:24:47 +0100 From: Thomas Omma <>

smj $ svn ci -m "removed ánársámegiel prefix, it was redundant" src/morphology/stemsSending src/morphology/stems/nouns.lexcTransmitting file data . Committed revision 64613.

albbas commented 11 years ago

Comment 7219

Date: 2012-10-29 21:17:21 +0100 From: Sjur Nørstebø Moshagen <>

(In reply to comment #2)

HFST - lemmas with spaces:

club music football teama world club

This is an HFST bug, and can be IGNORED for now. HFST does not accept input strings with spaces in them, at least not in the same way as Xerox' lookup utility does. It needs to be fixed by the HFST team.

This is a known "feature" - it requires that the space is declared as an alphabet character in the twolc file. Solved in svn revision 64618.

After Thomas' changes, the only problem words left are:

energi kondom nuppegiel nuppegiela

(both Xerox and hfst). These are left for Inga.

And most importantly: Xerox and HFST now behaves exactly the same, in ALL tests.

albbas commented 11 years ago

Comment 7229

Date: 2012-10-30 11:35:40 +0100 From: Inga Lill Sigga Mikkelsen <>

energi kondom nuppegiel nuppegiela

energi and kondom had the Use/Sub tag. Compound loanwords quote the first word, so there is no need to sub mark these words. Entries like "alkohol" are not sub marked so energi and kondom shouldn't be.

nuppegiel and nuppegiela are genitive forms, and the test only accepts nominative entries. I marked these entries with !, because we don't want the entry "nuppegiella".