Open albbas opened 4 years ago
Date: 2020-08-04 16:55:36 +0200
From: Linda Wiechetek <
We have a problem with the tokenization of multiwords here. In the following sentence, we want "nu ahte" to be analyzed as a multiword. Instead "maid nu" is analyzed as a multiword (since it's the first possible combination?? Not sure about the reason). CG can't do anything here as we do not get any options to disambiguate between. We need to resolve this.
Min mielas orro maid nu ahte sáhtášii leat vel buoret oktavuohta daid mielderuhtadeddjiiguin geat juo leat fárus muhtun eaŋkilprošeavttain.
"
I thought there was another bug about tokenization, but I can't find it in the bugzilla hierarchy. I find our categories a bit random, hard to know where tokenization should go f.eks. (hfst, lexico, sme??).
Date: 2020-08-10 07:03:12 +0200
From: Sjur Nørstebø Moshagen <
The basic tokenisation problem here is one of ambiguous and overlapping mwe's: maid nu vs nu ahte, in combination with the logic being used.
The tokeniser always returns the longest match, and when explicitly told to, also returns a sequence of shorter matches. The problem here is that there is no longest match that covers the sequence 'maid nu ahte' – it just is not a lexical unit in any way.
There is no generic solution to this that won't also cause an explosion in ambiguity and thus endless disambigutation problems. The easiest work-around is to lexicalise the string 'maid nu ahte', tag it as +Err/Lex (with a comment that it is needed due to tokenisation), and with explicit optional split marks between each word, to force the tokeniser to retokenise the string, and thus forward the info needed for proper mwe disambiguation later on.
This lexicalisation excercise has to be repeated for each case we find.
Date: 2020-08-31 09:53:59 +0200
From: Linda Wiechetek <
if this is going to be our strategy, the lexicon can easily explode too, maybe not for this case, but there will be other cases. Also, the error tags tend to get into the way of disambiguation. So we add a potential error source here too.
This issue was created automatically with bugzilla2github
Bugzilla Bug 2668
Date: 2020-08-04T16:55:36+02:00 From: Linda Wiechetek <>
To: Tommi A Pirinen <>
CC: lene.antonsen, sjur.n.moshagen, thomas.omma, trond.trosterud, unhammer+apertium
Last updated: 2020-08-31T09:53:59+02:00