himselfv / wakan

Japanese and Chinese learning tool with dictionary
36 stars 7 forks source link

Encoding problems based on OS language #266

Closed himselfv closed 6 years ago

himselfv commented 9 years ago

Original report by Anonymous.

Originally reported on Google Code with ID 266

Using the newese 1.90 version of your excellent product, I found that Wakan tends to
somehow interact with the underlying Windows OS in such a way, that the program gets
limited by the OS's language.

Here's what I mean: I was using the German version of Windows 7, but decided to make
a clean install, of which there's no German version available. So when I was using
the German version, everything was fine, even in older versions of Wakan.

However, after switching to the English version and trying your new version, I found
that Wakan does not recognize the German letters "ä, ö, ü" anymore. This is important,
because in the Wadoku dictionary it would of course list German translations. Whenever
one of the above letters is supposed to be shown in dictionary or even elsewhere it
is replaced by a small black dot and a relatively wide space. When I try to add that
entry into vocabulary it cuts off at the first point this happens at. So the vocab
get's basically cut.

If you proceed anyway the word in dictionary will be shown as you put it into your
vocab, so it will be blue and the german Wadoku text will be cut off at these points.

I found someone talking about all kinds of strange characters appearing on japanese
PCs, when you try to run Wakan. So my best guess would be that Wakan somehow takes
its range of accepted letters from the underlying OS.

None of this, btw, affects the Japanese or Chinese. Those can be seen just fine.

Thanks.

Reported by supermarkus420 on 2014-12-15 18:30:26

himselfv commented 9 years ago

Original comment by Anonymous.

Here's an example:

虫 [むし] insect, bug, cricket, moth, worm <pop,n> ——edict2; [1], Insekt, Wurm, Gew ——wadoku

Now as you see it cuts off in Wadoku. You only see "Gew". It is supposed to say "Gewürm".
I copied it as Text Line, so even there it cuts off.

Reported by supermarkus420 on 2014-12-15 18:55:46

himselfv commented 9 years ago

Original comment by Anonymous.

Here is a probably related Issue, so I will post it here:

I will show you two entries copied as Text Line:

つきましては (See 就いては) in line with this, therefore <conj,polite,kanji> ——edict2; daraus
resultierend, deshalb, (h ——wadoku (the "h" is product of a cutoff: it's supposed
to say "höfl." ("polite" marker)

This is the first: copied when the word is only in dictionary and not put in your vocab.

つきましては 1. <conj>(See ????) in line with this, therefore (polite; kanji); 2. daraus
resultierend, deshalb, (h <grammatikalische Ausdrücke>

Now it is put into my vocab and it replaces the Kanji with 4 Question Marks.

Okay these are all. Thanks for your efforts until and from now. I really do appreciate
it. Your work is outstanding!

Reported by supermarkus420 on 2014-12-15 19:01:43

himselfv commented 9 years ago
Both observations confirmed, will investigate (but ETA after new year). Thank you for
kind words!

Reported by himselfv on 2014-12-18 13:56:20

himselfv commented 9 years ago

Original comment by Anonymous.

No Prob. Thanks for improving this program as much as you did and still pressing on.
Let's hope this one will provide some people with an awesome tool for learning japanese.
It really is genuinely good and awe-inspiring for you to work on this in your spare
time!

Considering we're closing in on the holiday season, Happy Holidays and New Year!

Reported by supermarkus420 on 2014-12-19 11:06:37

himselfv commented 9 years ago

Original comment by Anonymous.

Since it seems to be related

The issue I found that is roughly equivalent to this is Issue 201 .

You mentioned using japanese locale to try to reproduce, but I really guess it is deeply
connected to the actual language the OS is using. So you could never reproduce the
weird text on japanese PCs with any other language. If that one gets fixed and Wakan
somehow decoupled from the OS language, a lot of problems that potential users around
the world could experience would be wiped out right there. That would be a major boon.

Reported by supermarkus420 on 2014-12-19 11:27:42

himselfv commented 9 years ago

Original comment by Anonymous.

It kinda strikes me as odd now. I have been learning about the difference of Unicode
programs and the old stuff and I thought that Unicode Wakan should work fine on Any
system. However it does not.

Observation:

This problem is no locale thing at all, it's deeply rooted in the OS language. 
- On German Win 7 it displays the german letters fine.
- On English Win 7 not so much. 
- On Japanese Win 7 all the characters of everything (menu etc.) get converted into
a jumbled mess, kinda like when you open an ANSI .txt Document with german letters
on a PC that has Japanese locale set (Japanese locale recognizes standard english letters
(like they all do), but not german ones).

Conclusion: There might be a problem with the Unicode side of things, maybe Unicode
implementation is faulty and falls back to the standard character set of the OS it's
running on?

This one's noticable whenever your leaving the "corridor of english". I also may be
able to explain why you cannot reproduce it. I guess your PC is russian language and
the only languages you're using in Wakan are Russian and English. Now, like all of
them, the russian language Windows recognizes the standard English alphabet just fine
and there's no problem, because the russian alphabet gets recognized as well.

So basically this is a problem whenever you are working outside of the language of
your own OS.  

Reported by supermarkus420 on 2015-08-09 10:00:28

himselfv commented 6 years ago

The second part ("See ????") is the same as Issue #254 and will be fixed with Issue #220.

The first part is interesting and should be checked. Additional info seems to be in support of #201 even though it's closed now.

himselfv commented 6 years ago

I've found the problem with #201 so that part is fixed now. Only the "does not recognize the German letters "ä, ö, ü" anymore" remains for this issue.

himselfv commented 6 years ago

Seems to be because of the JIS X 0212 3-byte characters in the German portions of the entries. These are not parsed properly by the decoder. (To be fair, they are not parsed properly by many editors). Confirmed that this is at least part of the problem: these characters do indeed show up broken in wakan.

Temporary solution: switched to using UTF-8 version of wadokujt by default. The same entries now look fine.

Proper solution: add JIS X 0212 support to the decoder. Will consider this now.

himselfv commented 6 years ago

JIS X 0212 wiki article: https://en.wikipedia.org/wiki/JIS_X_0212 Documentation: https://www.itscj.ipsj.or.jp/iso-ir/159.pdf

All JIS X 0212 sequences in EUC start with 0x8F, then accept two bytes in the standard EUC byte range 0xA1–0xFE.

To implement, we need to handle this lead byte, eat 2 additional bytes, pass to table conversion routine. Do the same backward to encode in it.

I don't think I want to do this at the moment so long as there's an UTF-8 solution for wadokuJT, but I'm still considering this.

himselfv commented 6 years ago

WadokuJT problems have been solved by using wadokuJT utf8 version by default. I'm not against adding support for JIS X 0212, but

  1. It's rarely used.

  2. It's some work.

I'll leave it for now, until the need arises.