himselfv / wakan

Japanese and Chinese learning tool with dictionary
36 stars 7 forks source link

Pop creep #267

Open himselfv opened 9 years ago

himselfv commented 9 years ago

Original report by Anonymous.

Originally reported on Google Code with ID 267

This one isn't a problem of Wakan, but of Edict 2.

I don't know why but Edict2 overuses the "pop" tag in such a way that almost every
word is marked by it.

Case in point: try to type in the Kanji 一日
You will see various entries with differing spellings for that words. I agree that
"tsuitachi" and "ichinichi" are pretty popular, but "hitoe" or "ichijitsu"? 

This has really become unbearable. For the longest time I avoided it by just downloading
the newest "old" edict version that is compiled at the wakan website. But Edict2 IS
supposed to be superior.

If there is a way to deal with this problem. I would appreciate it. Maybe getting rid
of the "pop" tag and just ordering by Word frequency would be the best (though it would
still favor Edict2 over other dictionaries that do not have frequency information).
It is relevant, because you can "prefer popular words" which is moot at this point.

Reported by supermarkus420 on 2014-12-15 18:38:42

himselfv commented 9 years ago

Original comment by Anonymous.

A bit related to this problem. Edict2 in general (though this applies to Edict as well)
is actually very bad at showing you reasonable translation or dictionary information.
There are so many words that either have freq info that puts them on par with other
words that are in truth much more commonly used. And of course in Edict2 they have
the Pop Tag as well:

like "ige" being further up the freq list for the Kanji 下 than "ika". Even though "ige"
is so rarely used in standard language.

The same problem exists with wadoku, it compiles all the possible meanings for a single
verb, even though most of these meanings do not apply to the verb as is but only in
combination with other words (idiomatic or not).

Sorry it's a bit of a rant here. On the upside: Edict2 and Wadoku are at least "comprehensive
enough".

Reported by supermarkus420 on 2014-12-15 19:09:08

himselfv commented 9 years ago
Is that really Edict2's fault? Here's what my copy has for 一日
一日(P);1日(P) [いちにち(P);いちじつ;ひとひ(ok);ひとえ(ok)] /(n) (1) (See ついたち・1) first day of the month/(2)
one day/(P)/EntL1576260X/
一日(P);1日(P);朔日;朔 [ついたち(P);さくじつ(朔日);いっぴ(一日)] /(n) (1) (usu. 一日 or 1日) first day of the
month/(2) (ついたち only) (arch) first ten days of the lunar month/(P)/EntL2225040X/
月立ち;月立;朔;一日 [つきたち] /(n) (1) (arch) first day of the month/(2) first ten days of the
lunar month/EntL2225030X/

So how I see it, 一日 and 1日 are marked as popular written words, and いちにち and ついたち as
popular readings. Then the whole entry is marked as popular.
Perhaps if the entry is popular as a whole, but has parts which are not marked (P),
we should consider popular only parts which are marked? So basically, ignore the "popular
as a whole" flag and only look at flags for particular kanji/kana? Is there anything
which contradicts this (e.g. entries popular as a whole but with none of its kanji/kana
marked P)?

Reported by himselfv on 2014-12-18 14:06:48

himselfv commented 9 years ago

Original comment by Anonymous.

You could be on to something: But first let me copy the first! three entries for 一 (the
copy function is superb for stamping out bugs AND sharing your vocab! Nice!)

This are three distinct entries that are marked as "pop" and that are the highest in
Freq mode (since I have Freq mode on)

一 [ひ] 1. (esp. ひと) (See 一・いち・1) one <pop,pref>; 2. (when counting aloud, usu. ひ or
ひい) one <pop,num> ——edict2
一 [ひと] 1. (esp. ひと) (See 一・いち・1) one <pop,pref>; 2. (when counting aloud, usu. ひ or
ひい) one <pop,num> ——edict2
一 [ひい] 1. (esp. ひと) (See 一・いち・1) one <pop,pref>; 2. (when counting aloud, usu. ひ or
ひい) one <pop,num> ——edict2

Now it probably is a bit difficult to articulate, what I am confused about:

Those are the first three entries for the Kanji 一, so they would be "the most popular",
but "hi" or "hii" ?? Even when counting, I usually hear "ichi" not "hii".

Edict2 displays differing readings for the same as separate entries.
And you are right. It seems, that they get marked as pop, because their translations
("one") are popular. 
But some or even most of the readings are not popular and shouldn't be shown to pretty
much anyone other than the Japanese Enthusiast, because they just aren't ever used.
Sometimes they sound outright slurred or dialectal.
It also seems that Edict2 just shares Freq and Pop information by virtue of these "esp.
markers" telling you that the entry "hi" is popular, especially "hito" for "one" and
"hi" or "hii" for counting "one". In this case there should really only be one entry
that shows you that "hito" is somewhat popular for "one" (even though it really isn't,
"ichi" is) And it should rigorously filter out anything that is obscure or at least
not mark it as pop.

I am no programmer, but I have thought of the following possible solution: consolidate
all these entries into one entry and then show only the really popular readings and
exactly what word they each stand for. Then you could have an "advanced mode" where
you could hover over the entry and see other, more specific and obscure usages, translations
and readings.

Now let me just say, that none of this is Wakan's fault. It really is Edict2. And I
trust it would be a mess. Sadly both Edict2 and Wadoku really seem to be kind of lazily
compiled, all things considered. However, if there are some ways or algorithms that
could limit this, then the signal-noise ration would be much improved. This would lessen
frustration and speed up meaningful learning.

Those are my thoughts about this. Thx as always.

Reported by supermarkus420 on 2014-12-19 11:49:12

himselfv commented 9 years ago

Original comment by Anonymous.

Ok. Let me today analyze this further: Same examples:
 一 [ひ] 1. (esp. ひと) (See 一・いち・1) one <pop,pref>; 2. (when counting aloud, usu. ひ or
ひい) one <pop,num> ——edict2
一 [ひと] 1. (esp. ひと) (See 一・いち・1) one <pop,pref>; 2. (when counting aloud, usu. ひ or
ひい) one <pop,num> ——edict2
一 [ひい] 1. (esp. ひと) (See 一・いち・1) one <pop,pref>; 2. (when counting aloud, usu. ひ or
ひい) one <pop,num> ——edict2

Observations:

- 「ひ」 and 「ひい」 are unusual and almost never used in today's standard language
- They still get marked as <pop> and have a high Frequency, because they get mistakenly
associated with [ひと] because of this (esp. ひと) marker
- So they share the parameters of [ひと], but show up as separate entries, confusing
people and cluttering dictionary and Kanji compound search results
- The second meaning of counting aloud is not <pop> at all in today's language. Today,
you would count by saying 「いち」. But the second meaning harkens back to  「ひ」 and 「ひい」
having their own entries.
- I also mentioned this one: 「以下」 which can be pronounced both 「いか」 and MUCH more rarely
「いげ」. However the rare pronunciation is also marked <pop>, probably because it gets
associated with the popular reading, just as a deviation. This is fine, but then it
should not be having a separate entry.

Conclusion:

- IMHO all of these entries are useless. Even [ひと]. You say 「いち」 and only use [ひと]
as part of compounds today. Of course if we go dialects or old language you could reasonably
find all these and that's the reason why I suggest having it as an option.
- Advanced Mode: When you enable the option in the settings the dictionary lets you
hover entries and shows you all these outdated readings, meanings etc.
- Normal Mode: Only [ひと] is shown as what it is today, a compound only word. You wouldn't
show the others or at least not mark them as <pop> as that's the main problem. All
this <pop> is making it useless.

Ok. This is all. I think Edict2 has gone overboard. It's really sad that this dictionary
has to be so inferior to e.g. The Wisdom (Japanese-English Dictionary). It's not inferior
in that it is limited, but it goes overboard by giving basic verbs every meaning under
the sun that they could have (including if this meaning only becomes a usable word
by combining the word with another one) and making their word ordering useless.

Or maybe it is us who have to upgrade our dictionary results to be more compatible
with the new Edict2? No matter what the root cause, we can make some huge gains here
for everyone, beginner or advanced, who will use this program in the future.

Reported by supermarkus420 on 2015-08-09 14:40:46