giellalt / lang-mns

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Mansi language
https://giellalt.uit.no
GNU Lesser General Public License v3.0
0 stars 0 forks source link

The analyser introduces an +Ex/V tag and the tokeniser does not remove the “+” sign before tag #4

Open rueter opened 3 months ago

rueter commented 3 months ago
cd lang-mrj
make distclean
./autogen.sh && ./configure --enable-tokenisers --enable-morpher 
make 

For some odd reason the analyzer introduces and +Ex/... tag, which is something lang-sms, for example, does not do. +Ex/... tags are brought in by the tokeniser.

hfst-lookup src/fst/analyser-gt-norm.hfstol 
> ӹлӹмӹжӹм
ӹлӹмӹжӹм    ӹлӓш+Ex/V+Der+Der/мЫ+Pass+Prc+A+Sg+Acc+PxSg3+So/PC  0,000000

The tokeniser cannot handle the previously introduced +Ex/V tag. Should the +Ex/V tag be showing up in the analyzer at all?

echo 'ӹлӹмӹжӹм' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<ӹлӹмӹжӹм>"
    "ӹлӓш"+Ex/V Der Der/мЫ Pass Prc A Sg Acc PxSg3 So/PC <W:0.0>
Trondtr commented 3 months ago

The tag is not declared in root.lexc, that is the problem. I did it now -- git pull. I did it for mrj and mns, there may be other lgs where it is needed.

rueter commented 3 months ago

The tag is not declared in root.lexc, that is the problem. I did it now -- git pull. I did it for mrj and mns, there may be other lgs where it is needed.

This does NOT SOLVE the issue in mrj, where the result of:

echo 'ӹлӹмӹжӹм' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

is still:

"<ӹлӹмӹжӹм>"
    "ӹлӓш"+Ex/V Der Der/мЫ Pass Prc A Sg Acc PxSg3 So/PC <W:0.0>
Trondtr commented 3 months ago

Ok, this seems to be thing: The filter for removing the tag exists but it not added to the Makefile. I have a look.

uit-mac-443 lang-mns (main)$ grep "rename-POS_before_Der-tags" ../lang-sme/src/fst/Makefile.am
                filters/rename-POS_before_Der-tags.hfst        \
        .o. @\"filters/rename-POS_before_Der-tags.hfst\"      \
                    filters/rename-POS_before_Der-tags.%      \
            .o. @\"filters/rename-POS_before_Der-tags.$*\"      \
                    filters/rename-POS_before_Der-tags.%      \
            .o. @\"filters/rename-POS_before_Der-tags.$*\"      \
                filters/rename-POS_before_Der-tags.hfst      \
        .o. @\"filters/rename-POS_before_Der-tags.hfst\"      \
        .o. @\"filters/rename-POS_before_Der-tags.hfst\"                  \
uit-mac-443 lang-mns (main)$ grep "rename-POS_before_Der-tags" src/fst/Makefile.am
(nothing)
Trondtr commented 3 months ago

Hmm, it wasn't that easy. The filter was missing in the mns catalogue, but not in the mrj one:

uit-mac-443 lang-mrj (main)$ grep "rename-POS_before_Der-tags" src/fst/Makefile.am
                filters/rename-POS_before_Der-tags.hfst        
           @\"filters/rename-POS_before_Der-tags.hfst\"      \
                filters/rename-POS_before_Der-tags.hfst        
           @\"filters/rename-POS_before_Der-tags.hfst\"      \
                filters/rename-POS_before_Der-tags.hfst        
           @\"filters/rename-POS_before_Der-tags.hfst\"      \
                    filters/rename-POS_before_Der-tags.$(1) 
               @\"filters/rename-POS_before_Der-tags.$(1)\" \

It thus seems I am not on the right track after all. Stay tuned.

snomos commented 3 months ago

The tag is not declared in root.lexc, that is the problem. I did it now -- git pull. I did it for mrj and mns, there may be other lgs where it is needed.

The tag is automatically created, and should not be added to root.lexc.

snomos commented 3 months ago
cd lang-mrj
make distclean
./autogen.sh && ./configure --enable-tokenisers --enable-morpher 
make 

For some odd reason the analyzer introduces and +Ex/... tag, which is something lang-sms, for example, does not do. +Ex/... tags are brought in by the tokeniser.

lang-sms should do it, as should all languages with a productive derivational system. It has to be added manually for each language, though, IIRC.

The change from POStag to Ex/POStag is done to avoid issues with disambiguation: CG does not care about tag positions, so if a tag string contains first a +V and then an +A tag, both rules for verbs and adjectives will be triggered. By automatically changing all non-final POS tags to the +Ex/xxx format, only the POS tag of the last derivation will be considered by the CG rules, which is exactly what you want in 99% of the cases.

hfst-lookup src/fst/analyser-gt-norm.hfstol 
> ӹлӹмӹжӹм
ӹлӹмӹжӹм  ӹлӓш+Ex/V+Der+Der/мЫ+Pass+Prc+A+Sg+Acc+PxSg3+So/PC  0,000000

The tokeniser cannot handle the previously introduced +Ex/V tag. Should the +Ex/V tag be showing up in the analyzer at all?

This is a separate issue:

echo 'ӹлӹмӹжӹм' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<ӹлӹмӹжӹм>"
  "ӹлӓш"+Ex/V Der Der/мЫ Pass Prc A Sg Acc PxSg3 So/PC <W:0.0>

All tags should automatically be converted to the CG format, where each + is replaced with a space. Since this does not happen, it might be that the Éx/V tag is not a real tag (a multichar symbol), just a string of individual letters. I would try to figure out exactly where and what is converting the +V to +Ex/V, and see if there is a bug there somewhere.

flammie commented 1 month ago

the tokeniser-disamb-gt-desc uses tags from analyser-disamb-gt-desc to generate the relabeling rules in tools/tokenisers/filters/ and analyser-disamb-gt-desc does not contain +Ex tags.

snomos commented 4 weeks ago

@rueter the solution is thus to ensure that the tag renaming script is also applied to the analyser-disamb-gt-desc file, you probably have to add some local/language specific compilation steps for that to happen. The same changes should also be used for other analysers, see how this is done in the Sámi languages.