wareya / nazeka_epwing_converter

NOT READY YET - Command line tool for converting zero-epwing dictionaries to nazeka json dictionaries
6 stars 1 forks source link

Kenkyusha support #2

Open epistularum opened 5 years ago

epistularum commented 5 years ago

I'll re-open this here then:

Something I've always wished from any pop up dictionary is a good support for the kenkyusha waei dictionary. The best support at the moment is still rikaisama and a lot of people stick to firefox 57/waterfox/palemoon... for the sole purpose of rikaisama and its regex deletion.

It's a good occasion to improve on rikaisama's EPWING feature (especially for kenkyusha due to its large number of of examples and what not).

For the moment with rikaisama and its regex feature we can clean up the definition field BUT in this process we're forced to delete all the examples. I've been trying to find a fix using regex but I'm not competent enough make a breakthrough, here's what I figured out so far.

Example of an entry :

まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
・締め切りに間に合う meet the deadline
・開演に間に合う arrive before curtain time
▲9 時の札幌行きに間に合うように空港に着いた. I arrived in time for the nine o'clock flight to Sapporo.
・「間に合うかな」「走っても間に合いそうにないね」 "Will we be in time?"―"It doesn't look like we'll be in time even if we run."
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
・これだけあれば丸 1 年は間に合う. This will last us [see us through] one whole year. | This will be enough for a whole year.

Where all entries starting with "▲" or "・" are examples and all entries matching this regex are definitions :

Regular expression that matches everything that is not a definition : \n[^″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*

Regular expression that matches definitions+one line below : \n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*\n.*

The perfect result should look like this(keeping one example for each definition) :

まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."

tldr : adding support for kenkyusha while keeping only 1 example for each definition would be a godsend for anyone learning Japanese and will finally achieve a breakthrough in the dictionary pop up app nonsense.

epistularum commented 5 years ago

I don't really have any programming skill but I can do some basic regex so if dictionaries are parsed using regex somewhere I'd love to help!!!

wareya commented 5 years ago

The kenkyuusha/wadai converter here right now has three modes: no stripping, light stripping, and heavy stripping.

Light stripping looks like this:

    "1 〔時間に遅れない〕 be in time 《for…》.",
    "►7 時の列車に間に合う catch [make] the 7 o'clock train",
    "►9 時の札幌行きに間に合うように空港に着いた. I arrived in time for the nine o'clock flight to Sapporo.",
    "2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of ┏use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve ┏the [one's] turn [need].",
    "►「費用はどのぐらいかな」「5 万もあれば間に合うよ」 \"And what is the expense?\"―\"Fifty-thousand yen should cover it.\""

And heavy stripping looks like this:

    "1 〔時間に遅れない〕 be in time 《for…》.",
    "2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of ┏use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve ┏the [one's] turn [need]."

I was going to add a "one example per sense" mode instead of a "primary examples only" (light stripping) mode, but for some reason I decided that the "main example" vs "subexample" distinction was good enough, since some words are defined entirely via a first example and then go on to have an additional example:

(heavy stripping)

  "r": "まんまん",
  "s": [
    "満々"
  ],
  "l": [
    "►満々たる full of 《ambition》; brimming with 《vigor》; filled with 《courage》"
  ]

(light stripping)

  "r": "まんまん",
  "s": [
    "満々"
  ],
  "l": [
    "►満々たる full of 《ambition》; brimming with 《vigor》; filled with 《courage》",
    "►挑戦者は闘志満々だった. The challenger was full of spirit."
  ]
wareya commented 5 years ago

ps: dictionary entries are parsed algorithmically so that they can do things like stop parsing once the definition starts listing related words instead of senses/examples of the word itself (kenkyuusha is guilty of this).

epistularum commented 5 years ago

"main example" vs "subexample"

What do you mean by that? Are you referring to examples starting with "►" and examples starting with "・"? Since I see that the first entry has two examples and the second has only one I assume that you are doing just that, keeping only the example sentences starting with "►". Kenkyusha refers to these as "blocks" where an example starting with "►" is the start of the a new block. But the sentence starting with "►" is not in fact a "main example" or better definition, it's merely a delimiter for what Kenkyusha qualifies as "blocks".

As defined by Kenkyusha:

用例の始まりに. 1 つの見出し語について用例がいくつかの「ブロック」に分かれているときは (凡例 2 本文 (1)-(3) 参照) 各ブロックの始まりごとにこの記号を繰り返した.

I may be completely wrong but I have a hard time grasping the meaning of Kenkyusha's explanation. I've asked a native's opinion on this matter and his take on it is that every block represents a particular use case.

You've already done some impressing work! Do you have dicsord/irc/whatever so we could discuss this more easily if you don't mind?

wareya commented 5 years ago

What do you mean by that? Are you referring to examples starting with "►" and examples starting with "・"?

Yes.

Since I see that the first entry has two examples and the second has only one I assume that you are doing just that, keeping only the example sentences starting with "►".

I only do that for entries that don't actually have a definition, only examples. I gave two examples, the first one shows only ・ stripped vs having all examples stripped, the second example shows having all ► examples present vs having a single ► example left unstripped because it had no actual definition.

I may be completely wrong but I have a hard time grasping the meaning of Kenkyusha's explanation.

I am considering the example at the beginning of a block of examples to be the "main" example of that block even though the text you quoted doesn't call it such, because each block of examples tends to have a clear theme, and the blocks tend not to have the kind of a consistent length or complexity that would indicate that they're completely arbitrary. If nothing else, it should be perfectly fine to consider the first example in a block of examples to be the main example in that block.

image

Kenkyuusha doesn't really follow its own formatting guidelines all the time anyway, there's plenty of entries with formatting that's inconsistent with how it should be.

I've asked a native's opinion on this matter and his take on it is that every block represents a particular use case.

Yeah, you can say that.

You've already done some impressing work! Do you have discord/irc/whatever so we could discuss this more easily if you don't mind?

I try to avoid realtime conversations about stuff like this, sorry about that.

epistularum commented 5 years ago

I only do that for entries that don't actually have a definition, only examples. I gave two examples, the first one shows only ・ stripped vs having all examples stripped, the second example shows having all ► examples present vs having a single ► example left unstripped because it had no actual definition.

Having only one example sentence is not reliable enough to explain the meaning of a word, the english translation for it could be an interpretation (meaning not a direct translation) or even contain an idiom, and so on. That is why I think we cannot rely solely on example sentences to convey the meaning of a word and thus such entries shouldn't appear at all in Nazeka.

I am well aware Kenkyusha has a lot of definition composed of only examples (even very basic words) but that is why we have to introduce a dictionary fallback feature.

I am considering the example at the beginning of a block of examples to be the "main" example of that block even though the text you quoted doesn't call it such, because each block of examples tends to have a clear theme, and the blocks tend not to have the kind of a consistent length or complexity that would indicate that they're completely arbitrary. If nothing else, it should be perfectly fine to consider the first example in a block of examples to be the main example in that block.

I agree with you that it's safe to assume that the "►" example is the "main" example, I was just pointing out that it's an assumption. More over, as you said the "►" examples do not seem to be arbitrary, but on the other hand they don't seem to follow their own rule, which makes them unreliable. The bottom line is that I think "►" should be viewed as "high priority" examples and "・" examples as "low priority". So for instance, let's say we come to the consensus that each definition should have 2 example sentences. The "►" examples should have the priority over "・" examples. So if we have a definition with 2 "►" examples and one "・" example, the two "►" examples should be picked over the "・" example. I'm just spewing out random thoughts, realistically it's really not that important, parsing only "►" sentences, or even randomly for that matter is perfectly fine.

On another topic, I see that your take on example sentences is keeping all the "►". Is there any instance where keeping all of those clutters greatly the output? For instance having 10 "►" examples. Is there any instance where a definition contains only "・"? I'm using an epwing viewer so it's hard to batch check for such instances, do you have a txt version of kenkyusha for easier regex parsing?

I try to avoid realtime conversations about stuff like this, sorry about that.

No worries.

epistularum commented 5 years ago

If I figured out regex strings for other dictionaries, would that help you at all? Honestly regex is the only thing I'm half good at on the matter of programing so I would like to at least be able help on that.

wareya commented 5 years ago

That is why I think we cannot rely solely on example sentences to convey the meaning of a word and thus such entries shouldn't appear at all in Nazeka.

It tends to work well enough. You're not going to learn the "true meaning" of a word from a dictionary anyway, and I think it's worth including them for the few people who will use nazeka with jmdict definitions completely disabled.

The idea behind filtering out most of the examples is just to reduce how many there are anyway. I don't want to have to build and collect blocks and decide which sentences to keep algorithmically, making it "the first example in each block" is good enough for me. There's always the mode that doesn't filter out any examples at all, after all.

Is there any instance where keeping all of those clutters greatly the output? For instance having 10 "►" examples.

Yes.

Is there any instance where a definition contains only "・"?

If there are then I probably output the first "・" example.

I'm using an epwing viewer so it's hard to batch check for such instances, do you have a txt version of kenkyusha for easier regex parsing?

You can run it through foosoft's zero-epwing manually to get something usable for that. That's the first step in the nazeka dictionary conversion process anyway.

If I figured out regex strings for other dictionaries, would that help you at all?

Nah, I'm writing the converters as algorithms anyway and all I really have to do is keep adding/changing filtering rules until it does what I want.

epistularum commented 5 years ago

That is why I think we cannot rely solely on example sentences to convey the meaning of a word and thus such entries shouldn't appear at all in Nazeka.

It tends to work well enough. You're not going to learn the "true meaning" of a word from a dictionary anyway, and I think it's worth including them for the few people who will use nazeka with jmdict definitions completely disabled.

I understand what you mean even though I think relying only on one dictionary is a mistake. Would it be easy to incorporate an toggle option that gets rid of entries that only contain examples? Either in the epwing_converter itself or Nazeka. It's really not that important but if it's easy to implement why not.

Is there any instance where a definition contains only "・"?

If there are then I probably output the first "・" example.

So if we have:

word
definition 1
・example 1
・example 2

It would output

word
definition 1
・example 1

? Or are you talking about entries with no definition, only example sentences?

Nah, I'm writing the converters as algorithms anyway and all I really have to do is keep adding/changing filtering rules until it does what I want.

If I can help in anyway tell me.

wareya commented 5 years ago

Or are you talking about entries with no definition, only example sentences?

This.