wareya / Spark-Reader

A tool to assist non-native speakers in reading Japanese
GNU General Public License v3.0
6 stars 0 forks source link

Support the same sqlite dictionaries as JGlossator #4

Open SpongebobSquamirez opened 5 years ago

SpongebobSquamirez commented 5 years ago

JGlossator has a number of sqlite dictionaries, and also allows importing a few dictionaries (including monolingual ones). Could you allow the same dictionaries to be ported to Spark reader? I'm particularly interested in accent information and monolingual definitions.

wareya commented 5 years ago

I'm sure this is possible, but this program is basically abandoned. The current recommendation is to use a clipboard grabber and a mouseover dictionary that supports extra dictionaries.

SpongebobSquamirez commented 5 years ago

I didn't realize the project was abandoned. It doesn't seem to be a very bad thing, and I think I was referred here by a friend who says he either knows your or has a mutual acquaintance, so I assumed it was still supported at least.

Do you know of any such mouseover dictionaries for desktop? The best thing I know of is the aforementioned JGlossator, but it doesn't hover over program like this does, which is pretty handy.

I'm asking because I made an extension to Capture2Text that plays well with this kind of clipboard detector dictionary thing (as Capture2Text itself does).

wareya commented 5 years ago

Most mouseover dictionaries are browser addons, like yomichan, rikaichamp, and my own nazeka. If you don't want to go with that kind of setup, then I guess jglossator or spark reader are fine, but things on this end of the learning tool spectrum tend not to be maintained well.

Spark reader is abandoned because the original developer moved on from it because of bad architectural/design decisions and also just wanting to do other things, or at least that's what I gather. I'm not about to fully maintain it myself.

LaurensWeyn commented 5 years ago

Original Spark Reader dev here (of which this project is a fork)

I have this bad tendency to want to do major refactors of old projects and then not completely finish them. Spark Reader went through something like that, though not as bad as it usually is.

The refactored branch works, but it's not exactly stable enough for a public release. Being the developer, I still use it myself and 'work around' the issues almost subconsciously and fix whatever I really need when I need it, but I've been too lazy to actually fix all the little bugs for another public release, so it's kind of sitting there.

I also feel like the current build is "good enough" or mostly feature complete/functional.

My Japanese is getting to a point where I don't need to look up every second word so I've been trying to read without Spark Reader as well, and I have less free time with a full time job, so my interest in maintaining it has dropped. However, I'm also soon going to finally attend formal Japanese classes this year so who knows, maybe I'll pick it up again in a few months.

Anyway, I do have interest in adding support for more dictionaries, especially since I've found Edict dictionaries rather underwhelming (at least, the ones I found). If you can find me some dictionaries you're interested in, and documentation on how they're formatted (Should be self explanatory with some sample sqlite dictionary files though) I'll look into adding support.

Don't count on it coming out in the next few weeks (or maybe even at all), but I at least still have interest in working on it. New dictionaries support could be some nice motivation to resume development.

wareya commented 5 years ago

Random exposition dump. The typical way of handling epwing dictionaries right now is to convert them to json with zero-epwing, then again convert them to an application-specific format. Yomichan uses a zip file with a collection of json files in it, but I'm not sure about the specific format. Yomichan's format supports tags. Nazeka uses a single json file with a format that looks something like this:

[
  {
    "r": "つるぎ",
    "s": [
      ""
    ],
    "l": [
      "line one of a definition",
      "line two of a definition",
      "etc",
      "last line of a definition"
    ]
  },
  {
    "r": "こだいしゅ",
    "s": [
      "古代種",
      "古代しゅ"
    ],
    "l": [
      "the ancients"
    ]
  },
// and so on
]

Unlike yomichan, nazeka's format doesn't support tags. It depends on jmdict being present to be able to disinflect verbs/adjectives to find root forms to look up in the json dictionary.

Part-of-speech information in epwing dictionaries is usually quite bad, or has inconsistent formatting, but that's an unrelated issue.

The main gotcha here is that zero-epwing doesn't (can't) convert 外字 to unicode, because they're just references to tiny pixelated images, so you have to dig up a conversion table made by someone else or dump the images (with zero-epwing) and make such a table on your own. Yomichan Import has some tables, and so does nazeka's epwing converter.

SpongebobSquamirez commented 5 years ago

JGlossator (linked above) has a detailed description of their sqlite format on their webpage.

I realized, though, after re-reading the manual (I swear I looked at it before!) that Spark Reader supports EPWINGS; and all of my dictionaries are actually (originally) in EPWING form, so I can just use those. (yes, I'm an idiot).

I was primarily interested in the NHKアクセント辞典, 新明解国語辞典、and 大辞林(三省堂).

wareya commented 5 years ago

Converting epwing dictionaries to a new format has the benefit that you can actually structure them properly for automated lookup. This requires different logic for each unique dictionary you want to convert though.