giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

Disable generation of optional forms ( #108

Closed albbas closed 14 years ago

albbas commented 14 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 848

Date: 2010-06-12T21:15:09+02:00 From: Francis Tyers <> To: Thomas Omma <> CC: sjur.n.moshagen, trond.trosterud, @unhammer@fsfe.org

Last updated: 2010-06-13T15:40:26+02:00

albbas commented 14 years ago

Comment 3337

Date: 2010-06-12 21:15:09 +0200 From: Francis Tyers <>

It would be nice for the purposes of MT from Finnish→North Sámi to be able to reduce the number of optional forms created in generation to one.

$ echo "200 vuoden historia" | apertium -d . fin-sme-chunker ^200$ ^jahki$ ^historjá$^.$

$ echo "200 vuoden historia" | apertium -d . fin-sme 200/200e/200d/200b/200š/200c/200:/200:e/200:d/200:b/200:š/200:c #jahki historjá

==============================================================================

$ echo "200+Num+Sg+Nom" | dsme 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 200+Num+Sg+Nom 200 200+Num+Sg+Nom 200- 200+Num+Sg+Nom 200-b 200+Num+Sg+Nom 200-c 200+Num+Sg+Nom 200-d 200+Num+Sg+Nom 200-e 200+Num+Sg+Nom 200-š 200+Num+Sg+Nom 200b 200+Num+Sg+Nom 200c 200+Num+Sg+Nom 200d 200+Num+Sg+Nom 200e 200+Num+Sg+Nom 200š 200+Num+Sg+Nom 200' 200+Num+Sg+Nom 200'b 200+Num+Sg+Nom 200'c 200+Num+Sg+Nom 200'd 200+Num+Sg+Nom 200'e 200+Num+Sg+Nom 200'š 200+Num+Sg+Nom 200: 200+Num+Sg+Nom 200:b 200+Num+Sg+Nom 200:c 200+Num+Sg+Nom 200:d 200+Num+Sg+Nom 200:e 200+Num+Sg+Nom 200:š

==============================================================================

The ideal fix would be something we can just grep out of the lexc file. See example in:

http://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-sme-fin/dev/update-lexc.sh

albbas commented 14 years ago

Comment 3339

Date: 2010-06-12 22:19:43 +0200 From: Sjur Nørstebø Moshagen <>

Thomas, can you have a look? Most of these forms look like +Use/Sub for me (all ending in -X and 'X), and I don't understand why a number compounded with a letter would be analysed as the nominative of that number. It all looks very buggy to me.

Also the single number ending in a hyphen should not have been analysed as a nominative, but as a compound form.

albbas commented 14 years ago

Comment 3340

Date: 2010-06-12 22:22:18 +0200 From: Sjur Nørstebø Moshagen <>

Assigning it, and raising the importance of it, as this bug probably affects several components using the sme transducer.

albbas commented 14 years ago

Comment 3342

Date: 2010-06-13 01:25:48 +0200 From: Francis Tyers <>

From bug #849:

~/gtsvn/gt$echo "200+Num+Sg+Nom" | dsme 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 200+Num+Sg+Nom 200

Fixed as from version 32426. The problem was identical upper side for several lower sides. The problem should now be ok in sme (at least there is no COUNTER ex. The issue is still open in smj.

============================================================================

Seems to work nicely now thanks!

$ echo "200 vuoden historia " | apertium -d . fin-sme 200 jagi historjá

albbas commented 14 years ago

Comment 3345

Date: 2010-06-13 09:02:39 +0200 From: Sjur Nørstebø Moshagen <>

Just a final note:

Almost all - if not all - of the unwanted forms are all marked +Use/Sub in the lexicon, identifying them as substandard forms. AFAICU you don't want substandard forms in generation, so one measure to take in any case is to remove these forms from the generating transducers. I would assume this to be the default behaviour.

If this is the case, then a line like the following is actually reduntant:

+Use/Sub+Use/Circ+Use/NG: ARABICCASECOLL ; ! This is the 1984s case.

since removing all +Use/Sub strings from the generating transducer would remove it. That is, +Use/NG would then only be needed for cases that are within the official norm.

albbas commented 14 years ago

Comment 3346

Date: 2010-06-13 13:10:37 +0200 From: Francis Tyers <>

(In reply to comment #4)

Just a final note:

Almost all - if not all - of the unwanted forms are all marked +Use/Sub in the lexicon, identifying them as substandard forms. AFAICU you don't want substandard forms in generation, so one measure to take in any case is to remove these forms from the generating transducers. I would assume this to be the default behaviour.

If this is the case, then a line like the following is actually reduntant:

+Use/Sub+Use/Circ+Use/NG: ARABICCASECOLL ; ! This is the 1984s case.

since removing all +Use/Sub strings from the generating transducer would remove it. That is, +Use/NG would then only be needed for cases that are within the official norm.

I added this rule (thanks Tommi and Unhammer!) and it seems to have resolved the problem.

!%+Dial/%-KJ Uselesspaths = %+Use/NG %+Use/Sub %+Use/NG %+Dial/%-GG %+Dial/%-GS ;

"Try again" Uselesspaths:0 /<= _ ;

albbas commented 14 years ago

Comment 3347

Date: 2010-06-13 14:17:13 +0200 From: Trond Trosterud <>

Sjur wrote: "I don't understand why a number compounded with a letter would be analysed as the nominative of that number."

That is easy to explain: We had entries like: +Sg+Nom+Use/Sub+Use/Circ:f # ; ! s. 123f. ! ! +Sg+Nom+Use/Sub+Use/Circ:ff # ; ! s. 123ff. ! !

They are now changed to: f+Sg+Nom+Use/Sub+Use/Circ:f # ; ! s. 123f. ! ! ff+Sg+Nom+Use/Sub+Use/Circ:ff # ; ! s. 123ff. ! !

The problem never surfaced until we started really generating stuff, like we do with the MT now.

I keep the bug open until it has been fixed for the other lgs.

albbas commented 14 years ago

Comment 3348

Date: 2010-06-13 15:40:26 +0200 From: Trond Trosterud <>

echo "200+Num+Sg+Nom" | dsmj 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 200+Num+Sg+Nom 200

~/gtsvn$see gt/sma/src/numeral-sma-lex.txt ~/gtsvn$echo "200+Num+Sg+Nom" | dsma 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 200+Num+Sg+Nom 200

~/gtsvn$echo "200+Num+Sg+Nom" | dfao 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 200+Num+Sg+Nom 200+Num+Sg+Nom +?

~/gtsvn$echo "200+Num+Sg+Nom" | dsmn 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 200+Num+Sg+Nom 200+Num+Sg+Nom +?

So, sme, smj, sma are ok here. Let the rest come when needed.