giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

tokenization of multi-word when there are several options of multi-words (Bugzilla Bug 2668) #450

Open albbas opened 4 years ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2668

Date: 2020-08-04T16:55:36+02:00 From: Linda Wiechetek <> To: Tommi A Pirinen <> CC: lene.antonsen, sjur.n.moshagen, thomas.omma, trond.trosterud, unhammer+apertium

Last updated: 2020-08-31T09:53:59+02:00

albbas commented 4 years ago

Comment 13958

Date: 2020-08-04 16:55:36 +0200 From: Linda Wiechetek <>

We have a problem with the tokenization of multiwords here. In the following sentence, we want "nu ahte" to be analyzed as a multiword. Instead "maid nu" is analyzed as a multiword (since it's the first possible combination?? Not sure about the reason). CG can't do anything here as we do not get any options to disambiguate between. We need to resolve this.

Min mielas orro maid nu ahte sáhtášii leat vel buoret oktavuohta daid mielderuhtadeddjiiguin geat juo leat fárus muhtun eaŋkilprošeavttain.

"" "mii nu" MWE Pron Indef Sg Acc : "" "ahte" CC "ahte" CC Err/Orth "ahte" CS "ahte" CS Err/Orth Err/Spellrelax "áhtat" V TV Ind Prs Du1 Err/Spellrelax "áhtat" V TV Ind Prt Pl3 Err/Spellrelax "áhtti" N Sem/Dummytag Sg Gen Allegro Err/Spellrelax : "<sáhtášii>" "sáhttit" V IV Cond Prs Err/Orth Sg3 SUBSTITUTE:2353 SUBSTITUTE:3979 SUBSTITUTE:4059 "sáhttit" V IV Cond Prs Sg3 SUBSTITUTE:2353 SUBSTITUTE:3979 SUBSTITUTE:4059

I thought there was another bug about tokenization, but I can't find it in the bugzilla hierarchy. I find our categories a bit random, hard to know where tokenization should go f.eks. (hfst, lexico, sme??).

albbas commented 4 years ago

Comment 13959

Date: 2020-08-10 07:03:12 +0200 From: Sjur Nørstebø Moshagen <>

The basic tokenisation problem here is one of ambiguous and overlapping mwe's: maid nu vs nu ahte, in combination with the logic being used.

The tokeniser always returns the longest match, and when explicitly told to, also returns a sequence of shorter matches. The problem here is that there is no longest match that covers the sequence 'maid nu ahte' – it just is not a lexical unit in any way.

There is no generic solution to this that won't also cause an explosion in ambiguity and thus endless disambigutation problems. The easiest work-around is to lexicalise the string 'maid nu ahte', tag it as +Err/Lex (with a comment that it is needed due to tokenisation), and with explicit optional split marks between each word, to force the tokeniser to retokenise the string, and thus forward the info needed for proper mwe disambiguation later on.

This lexicalisation excercise has to be repeated for each case we find.

albbas commented 4 years ago

Comment 13963

Date: 2020-08-31 09:53:59 +0200 From: Linda Wiechetek <>

if this is going to be our strategy, the lexicon can easily explode too, maybe not for this case, but there will be other cases. Also, the error tags tend to get into the way of disambiguation. So we add a potential error source here too.