giellalt / bugzilla-dummy

0 stars 0 forks source link

Bugs in Xerox tools: wâw sequences, multichar symbols ending in Obj (Bugzilla Bug 1739) #323

Closed albbas closed 7 years ago

albbas commented 10 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1739

Date: 2013-11-14T16:27:28+01:00 From: Trond Trosterud <> To: Trond Trosterud <> CC: sjur.n.moshagen

Last updated: 2016-12-18T13:40:57+01:00

albbas commented 10 years ago

Comment 8678

Date: 2013-11-14 16:27:28 +0100 From: Trond Trosterud <>

To repeat:

Add the following to the bottom of stems/nouns.lexc:

xîxa INDECL "Trond testing" ; xixâ INDECL "Trond testing" ; xêxa INDECL "Trond testing" ; xâxa INDECL "Trond testing" ; xôxa INDECL "Trond testing" ; xâxi INDECL "Trond testing" ; wâki INDECL "Trond testing" ; xâwi INDECL "Trond testing" ; wâwo INDECL "Trond testing" ;

Then, test:

wawi wawi wâwi+N+IN+Sg

wawo wawo wâwo+N+IN+Sg

wâwo wâwo wâwo +?

wâwi wâwi wâwi +?

xâwi xâwi xâwi+N+IN+Sg

xawi xawi xâwi+N+IN+Sg

The strange thing: The spellrelax â (->) a in the src/orthography/spellrelax.regex file should treat all a alike (as â candidates), but that happens only as long as it is not placed between two w.

So, what happens?

albbas commented 10 years ago

Comment 8824

Date: 2013-12-30 17:20:53 +0100 From: Trond Trosterud <>

This is a xerox bug, cf this report.

On Nov 15, 2013, at 9:46 AM, Trosterud Trond trond.trosterud@uit.no wrote:

An unexpected issue turned up when debugging a problem in our Plains Cree analyser. It turns out that xfst is not able to analyse words with the sequence wâw, but other sequences are ok. When we then add a spellrelax â (->) a in the bottom, it works fine. But when running the same code through hfst, it behaves.

Here we show output without spellrelax (with the spellrelax, wawa and wewa would have been recognised as wâwa and wêwa, respectively).

tf-hsl-m0016:crk ttr000$ hfst-lookup src/analyser-gt-desc.hfst pisiw pisiwpisiw+N+AN+Sg0.000000

wâwa wâwawâwa+N+AN+Sg0.000000

wêwa wêwawêwa+N+AN+Sg0.000000

^C tf-hsl-m0016:crk ttr000$ lookup src/analyser-gt-desc.xfst

LEXICON LOOK-UP

pisiw pisiwpisiw+N+AN+Sg

wâwa wâwawâwa+?

wêwa wêwawêwa+N+AN+Sg

To repeat: xfst -e "read lexc ctest.lexc" up wêwa up wâwa

The source files are available online as well, here:

http://giellatekno.uit.no/doc/lang/crk/PlainsCreeDocumentation.html

We also have a similar case for Russian, but not containing flag diacritcs, here an output from our testbench:

YAML test 25: ./N-Ж_Ф-железа_gt-norm.yaml + analyser-gt-norm.hfst - PASS YAML test 25: ./N-Ж_Ф-железа_gt-norm.yaml + analyser-gt-norm.xfst - FAIL To rerun with more details, please triple-click, copy and paste the following:

pushd /Users/ttr000/main/langs/rus/test/src/morphology; /opt/local/bin/python3.2 /Users/ttr000/main/gtcore/scripts/morph-test.py -c -i -S xerox --app /Users/ttr000/bin/lookup --gen ./../../../src/generator-gt-norm.xfst --morph ./../../../src/analyser-gt-norm.xfst ./N-Ж_Ф-железа_gt-norm.yaml; popd

The point here is that the same test passes for hfst but fails for xfst (the normal case is of course that the two fst-s agree on the verdict). What happens is that some twolc rule essentially moves stress (exchanges é with e etc.) to the first syllable of the word, this works, but not when there is a ë involved. Cf.:

http://giellatekno.uit.no/doc/lang/rus/RussianDocumentation.html

The test file N-Ж_Ф-железа_gt-norm.yaml can then be found under Source files: ... yaml, and the rest of the code under source files. I do not go more into the details, as the case is not as clear as the Plains Cree one (at least in our understanding of it, it involves both lexc and twolc, although the problem might well be linked to lexc only).

Our situation now is that we run the hfst and xerox tools in parallel (as the test outcome shows), and we are not dependent upon the xerox tools working. They do work for our core languages (the Saami languages), but now and then we thus stumble upon problems like these (cf. the Komi capitalisation issue some while ago).

A plausible outcome of this is a slow migration to the hfst tools, but the superior documentation (The Book!), and the habit in our fingers will certainly make the xerox tools relevant for years to come as well.

Thanks for your bug report. Ask me after a few weeks about whether there is already a fix. --

albbas commented 10 years ago

Comment 9328

Date: 2014-04-21 21:24:36 +0200 From: Trond Trosterud <>

New letter sent.

albbas commented 9 years ago

Comment 10421

Date: 2015-03-26 11:32:58 +0100 From: Sjur Nørstebø Moshagen <>

Here is another strange Xerox bug:

When making multichar symbols optional (e.g. for generation), such symbols ending in the character sequence 'Obj' (without the quotes, with the exact capitalisation as shown), the symbol can't be used as part of an input string.

I have checked all possible things in our own source code, like missing multicharacter declaration, no-breaking spaces, missing colons, etc, but everything seems fine. And there are no such problems with any of the other similar tags as far as I can tell.

The test data below is from SMA, with the relevant tag manually changed between different make & test runs to see what is working and what is not:

$ lookup -q src/generator-gt-norm.xfst Windows+N+Prop+Sg+Nom Windows+N+Prop+Sg+Nom Windows

Windows+N+Prop+Sem/Abj+Sg+Nom Windows+N+Prop+Sem/Abj+Sg+Nom Windows

YouTube+N+Prop+Sem/Obj+Sg+Nom YouTube+N+Prop+Sem/Obj+Sg+Nom YouTube+N+Prop+Sem/Obj+Sg+Nom +?

YouTube+N+Prop+Sg+Nom YouTube+N+Prop+Sg+Nom YouTube

$ lookup -q src/generator-gt-norm.xfst Windows+N+Prop+Sg+Nom Windows+N+Prop+Sg+Nom Windows

Windows+N+Prop+Sem/obj+Sg+Nom Windows+N+Prop+Sem/obj+Sg+Nom Windows

$ lookup -q src/generator-gt-norm.xfst Windows+N+Prop+Sg+Nom Windows+N+Prop+Sg+Nom Windows

Windows+N+Prop+Sem/Object+Sg+Nom Windows+N+Prop+Sem/Object+Sg+Nom Windows

The bug seems to be restricted to the lookup tool, cf the following:

Inspect: 1 @U.Cap.Obl@ W i n d o w s 0:+N 0:+Prop 0:+Sem/Obj @U.Cap.Obl@ 0:+Sg 0:+Nom @D.CmpOnly.FALSE@ @D.CmpPref.TRUE@ @D.NeedNoun.ON@ --> Level 18 (final)

Inspect: 1 @U.Cap.Obl@ W i n d o w s 0:+N 0:+Prop @U.Cap.Obl@ 0:+Sg 0:+Nom @D.CmpOnly.FALSE@ @D.CmpPref.TRUE@ @D.NeedNoun.ON@ --> Level 17 (final)

xfst[1]: up Windows+N+Prop+Sg+Nom Windows

xfst[1]: up Windows+N+Prop+Sem/Obj+Sg+Nom Windows xfst[1]: down Windows Windows+N+Prop+Sem/Obj+Attr Windows+N+Prop+Sem/Obj+Sg+Nom Windows+N+Prop+Attr Windows+N+Prop+Sg+Nom

$ lookup -q src/generator-gt-norm.xfst Windows+N+Prop+Sem/Obj+Sg+Nom Windows+N+Prop+Sem/Obj+Sg+Nom Windows+N+Prop+Sem/Obj+Sg+Nom +?

Windows+N+Prop+Sg+Nom Windows+N+Prop+Sg+Nom Windows

Further, it seems restricted to (one of) the latest version(s) of lookup only:

$ lookup -v lookup 2.5.19 (2.25.11)

Using an older version of lookup, everything works as expected:

$ ~/Downloads/xerox.100211/lookup -v lookup 2.5.14 (2.14.10) $ ~/Downloads/xerox.100211/lookup -q src/generator-gt-norm.xfst Windows+N+Prop+Sg+Nom Windows+N+Prop+Sg+Nom Windows

Windows+N+Prop+Sem/Obj+Sg+Nom Windows+N+Prop+Sem/Obj+Sg+Nom Windows

albbas commented 7 years ago

Comment 11866

Date: 2016-12-15 20:58:10 +0100 From: Trond Trosterud <>

This is a Xerox bug, and given the state of affairs, I suggest a WONTFIX.

albbas commented 7 years ago

Comment 11881

Date: 2016-12-18 13:40:57 +0100 From: Trond Trosterud <>

Thus, cnsensus on a WONTFIX