giellalt / lang-smj

Finite state and Constraint Grammar based analysers and proofing tools + language resources for Lule Sámi
https://giellalt.uit.no
GNU General Public License v3.0
2 stars 0 forks source link

TTS: 200 som tekst blir ikkje generert i akkusativ #35

Closed snomos closed 1 year ago

snomos commented 1 year ago

I denne setninga:

Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.

blir 200 disambiguert til akkusativ:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
  ./tools/tts/modes/trace-smj-normaliser8-cg.mode
[...]
"<200>"
        "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936 MAP:1357:>nNum @>N #9->10 SETPARENT:866:SetModToN
;       "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936
;       "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936
;       "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936
;       "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:3396
;       "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1387:Arab

Men den genererte ordforma er ikkje i akkusativ, ho er i nominativ i phon-elementet:

"<200>"
    "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon
        "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10
    "guoktatjuohte" Num Sg Nom "guoktatjuohte"phon
        "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10

Akkusativ er:

echo guoktatjuodát+A+Ord+Sg+Acc | hfst-lookup -q tools/tts/generator-gt-norm.hfstol
guoktatjuodát+A+Ord+Sg+Acc  guoktjuodádav   0.000000
guoktatjuodát+A+Ord+Sg+Acc  guoktatjuodádav 0.000000
guoktatjuodát+A+Ord+Sg+Acc  guoktetjuodádav 0.000000

Slik det er skildra her er tanken at vi som siste steg i normaliseringa generer rett form basert på taggane i originalordet. Gjer vi det, eller er det andre problem?

flammie commented 1 year ago

slik som den regenerasjonsteg er nå det pröver å generere guoktatjuodát+Num+Arab+Sg+Acc, den mangler noen ikke-triviell logik å få Adj+Ord fra Num+Arab?

snomos commented 1 year ago

Ah - eg hadde missa at det ikkje var same ordklasse. Det er sjølvsagt ei anna historie. Før vi går vidare - @ilm024 kva er rett ordform i denne konteksten? Kva er det vi burde generera? Er det ei av desse formene?

echo guoktatjuohte+Num+Sg+Acc | hfst-lookup -q tools/tts/generator-gt-norm.hfstol                                            
guoktatjuohte+Num+Sg+Acc    guoktjuodev 0.000000
guoktatjuohte+Num+Sg+Acc    guoktatjuodev   0.000000
guoktatjuohte+Num+Sg+Acc    guoktetjuodev   0.000000
guoktatjuohte+Num+Sg+Acc    guovtetjuodev   0.000000

?

snomos commented 1 year ago

Når eg ser ein gong til på dette dømet, så er det likevel rett fram etter den algoritmen vi har lagt til grunn. Algoritmen er (kopiert frå dokumentet eg lenka til lenger opp):

  1. generate new lemma using normaliser FST
  2. Take the original analysis, and remove every prefixed tag (prefixed tags are those of the form Abcd/xxx, where Abcd/ is the tag prefix) + the target tag (ABBR in this case): Area/NO N Sem/Hum ABBR Gram/TAbbr Sg AccN Sg Acc
  3. Use the new lemma and the new analysis string to generate the corresponding surface form: dåktår N Sg Accdåktårav

I dømet med 200 så blir det slik:

"<200>"
    "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon
        "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10
    "guoktatjuohte" Num Sg Nom "guoktatjuohte"phon
        "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10

Av desse to analysene er den fyrste irrelevant, fordi ordklassen ikkje stemmer - vi har Num inn, og skal ha Num ut, og dermed kan vi sjå bort frå A Ord-analysen. Deretter fjernar vi måltaggen Arab + alle prefiks-taggar frå originalanalysen (Num Arab Err/Orth Sg Acc), og då står vi att med Num Sg Acc. Dette er nøyaktig det vi treng for å generera den ordforma vi vil ha (om det er den vi vil ha, det må altså @ilm024 svara på 🙂 ).

flammie commented 1 year ago

ja alså nå ble Num brukt som tagg för normalisering men om det var Arab og vi kan altid ta den bort det kan gå bra.

snomos commented 1 year ago

Hm, vi kan ikkje bruka Num som trigger for normalisering, Num er jo ein tagg som blir brukt for talord skrive ut som tekst òg, og som difor ikkje treng normalisering. Eg skal endra til Arab, slik at vi kan ta bort Arab fordi det var Arab som var taggen som trigga normaliseringa. Logikken må vera at taggen som triggar normalisering er den vi vil ha bort etter normalisering.

snomos commented 1 year ago

Då har eg endra Num og Ord til Arab, i pipespec.xml.in 🙂

lynnda-hill commented 1 year ago

Fiksa disambiguatoren slik at ikke Err/Orth blir valgt. Nå er resultatet slik (dvs. Gen siden jage også blir disambiguert til Gen):

Kan jeg lukke buggen?


"<200>"
        "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1387:Arab
;       "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:4021:errsub
;       "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:4021:errsub
;       "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:4021:errsub
;       "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:3396
;       "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1387:Arab
: 
"<jage>"
        "jahke" N <smj> Sem/Time Sg Gen "jahke>Q1"MIDTAPE <W:0.0> SELECT:2523 SUBSTITUTE:4028
;       "jahke" N Sem/Time Pl Nom "jahke>Q1"MIDTAPE <W:0.0> SELECT:2523
: 
"<maŋŋela>"
        "maŋŋel" N <smj> Sem/Dummytag Sg Gen "maŋŋela"MIDTAPE <W:0.0> SUBSTITUTE:4028
        "maŋŋela" Adv <smj> "maŋŋela>"MIDTAPE <W:0.0> SUBSTITUTE:4029
        "maŋŋela" Po <smj> "maŋŋela>"MIDTAPE <W:0.0> SUBSTITUTE:4033
        "maŋŋela" Pr <smj> "maŋŋela>"MIDTAPE <W:0.0> SUBSTITUTE:4034
snomos commented 1 year ago

No er disambigueringa i orden, men framleis er det problem med normaliseringa. Det som går inn til normaliseraren er dette, i genitiv, slik Linda seier (med kommandoen echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | ./tools/tts/modes/trace-smj-normaliser8-cg.mode):

"<200>"
    "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1388:Arab MAP:1357:>nNum @>N #9->10 SETPARENT:866:SetModToN
;   "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;   "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

Men etter normaliseringa er det framleis nominativ, og vi har ei ekstra A Ord-analyse:

"<200>"
    "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma
    "guoktatjuohte" Num Sg Nom "guoktatjuohte"phon "200"oldlemma
;   "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;   "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab
flammie commented 1 year ago

den har blitt lit komplisert så ä skrev ut alle versioner med full trace i dagens version:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | ~/github/giellalt/lang-smj/tools/tts/modes/smj-normaliser8-cg.mode |  ~/github/divvun/libdivvun/src/divvun-normaliser -a  '/home/flammie/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol' -g  '/home/flammie/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol' -n  '/home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol' -t ABBR -t Arab -t Ord -t Symbol -v
Being verbose.
Surface analyser set to: /home/flammie/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol
Normaliser set to: /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol
Generator set to: /home/flammie/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol
Deep analyser set to: 
Tags set to: ABBR Arab Ord Symbol 
Reading files: 
* /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol
* /home/flammie/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol
* /home/flammie/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol
* 
expanding tags: 
New surface form: Dát
"<Dát>"
Using lemma: dát
No expansion tags in
    "dát" Pron Dem Sg Ela Attr <W:0.0> @>N #1->2
Probably not cg formatted stuff: 
: 
New surface form: máhtto
"<máhtto>"
Using lemma: máhtto
No expansion tags in
    "máhtto" N Sem/Prod-cogn Sg Nom <W:0.0> @SUBJ> #2->0
Probably not cg formatted stuff: 
: 
New surface form: de
"<de>"
Using lemma: de
No expansion tags in
    "de" Adv <W:0.0> @ADVL> #3->5
Probably not cg formatted stuff: 
: 
New surface form: mak
"<mak>"
Using lemma: mak
No expansion tags in
    "mak" Adv <W:0.0> @ADVL> #4->5
Probably not cg formatted stuff: 
: 
New surface form: jåvsåj
"<jåvsåj>"
Using lemma: jåksåt
No expansion tags in
    "jåksåt" <mv> V TV Ind Prt Sg3 <W:0.0> @FMV #5->0
Probably not cg formatted stuff: 
: 
New surface form: Finnmárko
"<Finnmárko>"
Using lemma: Finnmárkko
No expansion tags in
    "Finnmárkko" OLang/NOB N Prop Sem/Plc Sg Gen <W:0.0> @>N #6->7
Probably not cg formatted stuff: 
: 
New surface form: sámijda
"<sámijda>"
Using lemma: sábme
No expansion tags in
    "sábme" N Sem/Hum_Lang Pl Ill <W:0.0> @<ADVL #7->5
Probably not cg formatted stuff: 
: 
New surface form: suláj
"<suláj>"
Using lemma: sulla
No expansion tags in
    "sulla" N Sem/Dummytag Pl Com <W:0.0> @<ADVL #8->5
Probably not cg formatted stuff: 
: 
New surface form: 200
"<200>"
Expanding because of Arab
Using lemma: 200
1. looking up normaliser
2.a Using normalised form: guoktatjuodát
2.b regenerating lookup: guoktatjuodát+Num+Sg+Gen
3. Couldn't regenerate, reanalysing lemma: guoktatjuodát
    "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma
2.a Using normalised form: guoktatjuohte
2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen
3. reanalysing: guoktjuode
    "guoktatjuohte" Num Pl Nom "guoktjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Ill Attr "guoktjuode"phon "200"oldlemma
3. reanalysing: guoktatjuode
    "guoktatjuohte" Num Pl Nom "guoktatjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Ill Attr "guoktatjuode"phon "200"oldlemma
3. reanalysing: guoktetjuode
    "guoktatjuohte" Num Pl Nom "guoktetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Ill Attr "guoktetjuode"phon "200"oldlemma
3. reanalysing: guovtetjuode
    "guoktatjuohte" Num Attr "guovtetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guovtetjuode"phon "200"oldlemma
Probably not cg formatted stuff: 
: 
New surface form: jage
"<jage>"
Using lemma: jahke
No expansion tags in
    "jahke" N Sem/Time Sg Gen <W:0.0> @<ADVL #10->5
Probably not cg formatted stuff: 
: 
New surface form: maŋŋela
"<maŋŋela>"
Using lemma: maŋŋel
No expansion tags in
    "maŋŋel" N Sem/Dummytag Sg Gen <W:0.0> @>N #11->12
Probably not cg formatted stuff: 
: 
New surface form: Kristusa
"<Kristusa>"
Using lemma: Kristus
No expansion tags in
    "Kristus" OLang/UND N Prop Sem/Mal Sg Gen <W:0.0> @P< #12->12
Probably not cg formatted stuff: 
: 
New surface form: riegádime
"<riegádime>"
Using lemma: riegádibme
No expansion tags in
    "riegádibme" N Sem/Dummytag Gram/NomAct Pl Nom <W:0.0> @<SUBJ #13->5
New surface form: .
"<.>"
Using lemma: .
No expansion tags in
    "." CLB <W:0.0> #14->2
Probably not cg formatted stuff: 
:\n
Probably not cg formatted stuff: 

eller det er nesten densamme som:

$ echo 200 | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol -q
200 guoktatjuodát   0,000000
200 guoktatjuohte   0,000000
echo guoktatjuodát+Num+Sg+Gen | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol -q
guoktatjuodát+Num+Sg+Gen    guoktatjuodát+Num+Sg+Gen+?  inf
$ echo guoktatjuodát | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol -q
guoktatjuodát   guoktatjuodát+A+Ord+Attr    0,000000
guoktatjuodát   guoktatjuodát+A+Ord+Sg+Nom  0,000000
$ echo guoktatjuohte+Num+Sg+Gen  | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol -q
guoktatjuohte+Num+Sg+Gen    guoktjuode  0,000000
guoktatjuohte+Num+Sg+Gen    guoktatjuode    0,000000
guoktatjuohte+Num+Sg+Gen    guoktetjuode    0,000000
guoktatjuohte+Num+Sg+Gen    guovtetjuode    0,000000
$ echo guoktjuode | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol -q
guoktjuode  guoktatjuohte+Num+Pl+Nom    0,000000
guoktjuode  guoktatjuohte+Num+Sg+Gen    0,000000
guoktjuode  guoktatjuohte+Num+Sg+Ill+Attr   0,000000

osv.

snomos commented 1 year ago

So in the debug version it all looks good, except we could throw away some stuff, and we need to restrict the normaliser a bit, to not generate four variants of the same morphosyntactic form. I have commented the relevant parts below:

"<200>"
Expanding because of Arab
Using lemma: 200
1. looking up normaliser
2.a Using normalised form: guoktatjuodát
2.b regenerating lookup: guoktatjuodát+Num+Sg+Gen
3. Couldn't regenerate, reanalysing lemma: guoktatjuodát
    "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma

guoktatjuodát should be thrown away, since 'A' does not match 'Num'. The following normalised string is what we want:

2.a Using normalised form: guoktatjuohte
2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen
3. reanalysing: guoktjuode
    "guoktatjuohte" Num Pl Nom "guoktjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Ill Attr "guoktjuode"phon "200"oldlemma
3. reanalysing: guoktatjuode
    "guoktatjuohte" Num Pl Nom "guoktatjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Ill Attr "guoktatjuode"phon "200"oldlemma
3. reanalysing: guoktetjuode
    "guoktatjuohte" Num Pl Nom "guoktetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Ill Attr "guoktetjuode"phon "200"oldlemma
3. reanalysing: guovtetjuode
    "guoktatjuohte" Num Attr "guovtetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guovtetjuode"phon "200"oldlemma

These forms are all good, but we need to restrict the normaliser so that it only generates the form we want (we need @ilm024 to decide which one). Alternatively, we generate all, but give them variant tags in the reanalysis, with enough information to select using CG in the next step. That way we can generate forms that fits with the rest of the text, given that there are clues in the rest of the text as to which version/style to pick. If we go this route, we still need to designate one variant as the default, probably tagged v1, and select that if no other information is given. We need to end up with one variant only, but that does not need to happen in the normaliser step. The normaliser should probably give a warning, though, in cases where there are several alternative outputs with the exact same analysis. That is, in the example above, we should get a warning for all four Num Sg Gen variants, since there are no variant tags to differentiate them. And the normaliser should only return the first one in this case. If the normaliser returns several forms, with different tags in the reanalysis, return them all.

What I do not understand is why we end up with:

"<200>"
    "guoktatjuohte" Num Sg Nom "guoktatjuohte"phon "200"oldlemma

ie guoktatjuohte and Num Sg Nom, when the reanalysed forms clearly says f.ex. guoktatjuode and Num Sg Gen, as in e.g.:

3. reanalysing: guoktjuode
    "guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma

So although everything is correct, we end up with the wrong form. That looks like a bug somewhere.

The other forms, like Num Pl Nom and Num Sg Ill Attr, should be filtered out because of tag mismatch with the input tags.

flammie commented 1 year ago

Current version should throw away all tag strings that don't match (with ; in debug mode).

snomos commented 1 year ago

Thanks, I just tested it. This is what I get:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' \
| ./tools/tts/modes/trace-smj-normaliser.mode
"<200>"
    "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1388:Arab MAP:1357:>nNum @>N #9->10 SETPARENT:866:SetModToN
;   "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;   "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

So no conversion anymore. I then run it on just 200, with more details:

echo 200 | hfst-tokenise -g tools/tts/tokeniser-tts-cggt-desc.pmhfst | \
  egrep '(^"| Gen )' | \
divvun-normaliser -v -a tools/tts/analyser-gt-norm.hfstol \
-g tools/tts/generator-gt-norm.hfstol -n tools/tts/transcriptor-gt-desc.hfstol -t Arab
Being verbose.
Surface analyser set to: tools/tts/analyser-gt-norm.hfstol
Normaliser set to: tools/tts/transcriptor-gt-desc.hfstol
Generator set to: tools/tts/generator-gt-norm.hfstol
Deep analyser set to: 
Tags set to: Arab 
Reading files: 
* tools/tts/transcriptor-gt-desc.hfstol
* tools/tts/generator-gt-norm.hfstol
* tools/tts/analyser-gt-norm.hfstol
* 
expanding tags: 
New surface form: 200
"<200>"
Expanding because of Arab
Using lemma: 200
1. looking up normaliser
2.a Using normalised form: guoktatjuodát
2.b regenerating lookup: guoktatjuodát+Num+Sg+Gen+MIDTAPE
3. Couldn't regenerate, reanalysing lemma: guoktatjuodát
;   "guoktatjuodát" A Ord Attr "guoktatjuodát"phon "200"oldlemma NORMALISER_REMOVE:notgenerated
;   "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma NORMALISER_REMOVE:notgenerated
2.a Using normalised form: guoktatjuohte
2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen+MIDTAPE
3. Couldn't regenerate, reanalysing lemma: guoktatjuohte
;   "guoktatjuohte" Num Sg Nom "guoktatjuohte"phon "200"oldlemma NORMALISER_REMOVE:notgenerated
    "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0>

For whatever reason it is not able to generate. I then try the analysis and generation steps with the fst's used by the normaliser:

echo guoktatjuohte | hfst-lookup -q tools/tts/analyser-gt-norm.hfstol                                                                                         
guoktatjuohte   guoktatjuohte+Num+Sg+Nom    0.000000

echo guoktatjuohte+Num+Sg+Nom | hfst-lookup -q tools/tts/generator-gt-norm.hfstol                                                                                                                               
guoktatjuohte+Num+Sg+Nom    guoktjuohte 0.000000
guoktatjuohte+Num+Sg+Nom    guoktatjuohte   0.000000
guoktatjuohte+Num+Sg+Nom    guoktetjuohte   0.000000

echo guoktatjuohte+Num+Sg+Gen | hfst-lookup -q tools/tts/generator-gt-norm.hfstol                                                                                                                               
guoktatjuohte+Num+Sg+Gen    guoktjuode  0.000000
guoktatjuohte+Num+Sg+Gen    guoktatjuode    0.000000
guoktatjuohte+Num+Sg+Gen    guoktetjuode    0.000000
guoktatjuohte+Num+Sg+Gen    guovtetjuode    0.000000

No problems whatsoever. So the question is: why can't the generator generate when used in the normaliser, when there is no problems when used directly on the command line?

snomos commented 1 year ago

Det kan sjå ut som om MIDTAPE blandar seg inn i genereringa - er det rett streng som blir sendt til generatoren? Jf:

2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen+MIDTAPE
snomos commented 1 year ago

Om det er det så vil det forklara kvifor genereringa ikkje går gjennom 🙂

flammie commented 1 year ago

ah ja det er sant men det går litt på den tema vi snakke siste uke at det er forskjellige cg-parsers alla steder i kodebase, denne var ikke särleg flink med tagtolkingar.

snomos commented 1 year ago

Ok. Vi burde kanskje ha berre ein kodebase for å parsa CG, ev bruka kode frå VislCG3-koden?

snomos commented 1 year ago

Fixed in https://github.com/divvun/libdivvun/commit/d0647bf3f655c7eddffe768af626d65d3438fe2d:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
./tools/tts/modes/trace-smj-normaliser.mode
...
"<200>"
    "guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktetjuode"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guovtetjuode"phon "200"oldlemma
;   "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;   "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;   "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;   "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

Great! We can move on to the next bug 🙂

ilm024 commented 1 year ago

Vi skal ha Gen som "guoktatjuode". Guoktjuode går ikke, da det ikke er en sammensetningsdel foran. Dette er på min "to do" liste", men jeg trenger hjelp, da jeg ikke får det til selv. "Guokte" er allerede tagget med Use/Marg,kan man ikke styre unna disee automatisk?

snomos commented 1 year ago

Ja, det skal skje automatisk. Det krevst litt omorganisering, men det skal bli ordna.

snomos commented 1 year ago

No skjer det automatisk basert på taggane du har lagt inn, @ilm024 🙂

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
./tools/tts/modes/trace-smj-normaliser.mode
...
"<sámijda>"
    "sábme" N Sem/Hum_Lang Pl Ill "sábme>Q1jda"MIDTAPE <W:0.0> @<ADVL #7->5
: 
"<suláj>"
    "sulla" N Sem/Dummytag Pl Com "sulla>Q1j"MIDTAPE <W:0.0> @<ADVL #8->5
: 
"<200>"
    "guoktatjuodát" A Ord Attr "guoktatjuodát"phon "200"oldlemma
    "guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma
    "guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
: 
"<jage>"
    "jahke" N Sem/Time Sg Gen "jahke>Q1"MIDTAPE <W:0.0> @<ADVL #10->5
: 
"<maŋŋela>"
    "maŋŋel" N Sem/Dummytag Sg Gen "maŋŋela"MIDTAPE <W:0.0> @>N #11->12

@flammie når det gjeld ordenstalsforma som dukkar opp, så ser det ut som eit steg tilbake (ein regresjon) - du hadde jo løyst det problemet tidlegare? Dvs ignorer genererte former som ikkje stemmer i POS med den ordklassa vi sender inn, stemmer ikkje det? Så kva skjer her?

flammie commented 1 year ago

tror dem var kasta bort för pga genereringsfeil, som fiks til #36 så bruker vi lemmaform fra transcriptor uansett. Kanskje ä kan bare matcha taggenne med denne form også...

snomos commented 1 year ago

Etter at eg fekk testa med nyaste bygg av libdivvun, kan eg stadfesta at ting fungerer som dei skal:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
./tools/tts/modes/smj-normaliser.mode
[...]
"<suláj>"
    "sulla" N Sem/Dummytag Pl Com "sulla>Q1j"MIDTAPE <W:0.0> @<ADVL #8->5
: 
"<200>"
    "guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
: 
"<jage>"
    "jahke" N Sem/Time Sg Gen "jahke>Q1"MIDTAPE <W:0.0> @<ADVL #10->5
: 
"<maŋŋela>"
    "maŋŋel" N Sem/Dummytag Sg Gen "maŋŋela"MIDTAPE <W:0.0> @>N #11->12

Ingen fleire fleirtydige former, berre den vi vil ha 😄