sorlok / waitzar

Automatically exported from code.google.com/p/waitzar
Other
0 stars 0 forks source link

Possible re-encoding of Zawgyi wordlist might be helpful #70

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
#WZ 1.5, dictionary issue

Consider:
ႏ ံ ွ = nhan

According to UTN 11, "dot above" should occur after "leg back". In fact,
this issue was brought up before, but at the time UTN11 was becoming
obsolete. Now, they've revised it, and it's very clean.

From Zawgyi's point of view, it is harmless to switch the previous word to:
ႏ ွ ံ = nhan
...they render the same way. Several other combining marks have a similar
property.

Of course, some things must remain out-of-order. Besides "a" and "ya-yit",
there are subtle entries which just look bad when rendered in "proper"
order. I don't see this as an issue, since Zawgyi is inherently
incompatible with Unicode 5.1.

However, there are some benefits to bringing our dictionary more in-line
with UTN11:
 1) Naive sorting algorithms designed for Unicode 5.1 won't break so badly
on Zawgyi text typed in WZ. (Consider: Excel)
 2) Zawgyi-One has almost no research done on it, so we might as well set a
standard, for legacy research purposes.

Moreover:
 1) We've got to make some changes to the model anyway, before we can
introduce fully-compliant letter-based typing. Might as well clean up the
model before we release it again; it shouldn't affect anybody if we do
proper testing.

So this is more of a WZ 1.8 release goal.

Original issue reported on code.google.com by seth.h...@gmail.com on 16 Mar 2009 at 6:06

GoogleCodeExporter commented 9 years ago
Note: 1.8 will require a Unicode-based encoding internally, so this bug is moot.

However, converting from/to Unicode brings up the issue of encoding order, so 
I'm
leaving this open until we validate our converter.

Original comment by seth.h...@gmail.com on 24 Nov 2009 at 2:56

GoogleCodeExporter commented 9 years ago
Note from 1.8 super-bug:
------------------------

I'm removing bug 70; Unicode is used internally, but models can still maintain 
their own scratch encoding.

1.8's big contribution in this regard was converting Burglish to Unicode. I'll 
save WaitZar's wordlist for 1.9, since we'll be touching up the WZ wordlist 
anyway for 1.9.

Original comment by seth.h...@gmail.com on 18 Aug 2010 at 7:04

GoogleCodeExporter commented 9 years ago
More info:
   Part of the reason we're not updating is because there are some words that can appear both ways. E.g., မွဴး and မႉး both appear in our wordlist. By Myanmar's own spelling rules, these two should be equivalent, but there was a lot of uncertainty in the original voting. 

Since 1.9 will only be a partial re-vote (khyit -> chit, etc.), we'll be able 
to spend more time hunting down experts to get a final word on the equivalency 
issues. Otherwise, simple round-trip conversions like ZG->UNI->ZG will fail.

Original comment by seth.h...@gmail.com on 4 Oct 2010 at 2:56