themoeway / yomitan

Japanese pop-up dictionary browser extension. Successor to Yomichan.
https://chromewebstore.google.com/detail/yomitan/likgccmbimhjbgkjambclfkhldnlhbnn
GNU General Public License v3.0
989 stars 76 forks source link

Deinflection data format update #581

Closed toasted-nutbread closed 5 months ago

toasted-nutbread commented 6 months ago

I'm messing around with the idea of updating the structure of the deinflection file to support a few things:

  1. Better clarity - things like #547 would be a bit more explicitly defined rather than having to manually define it using bitflags magic.
  2. More generalized - should be more generalized for other languages and use less Japanese-specific naming.
  3. Internationalization - names/descriptions can have different variations provided for other languages.
  4. Extensibility - Internationalization features and new rules can eventually be imported into a single deinflector. This will require changes to the deinflector code obviously, but the intent is to make the source data format more conducive for this.
  5. Cleaner code - less manual definition of bitflags will be needed; the bitflags can be automatically generated from the input file(s).

So here's somewhat of a preview of what might work:

{
    "rules": {
        "v1": {
            "name": "Ichidan verb",
            "partsOfSpeech": ["v1"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "一段動詞"
                }
            ],
            "subRules": ["v1d", "v1p"]
        },
        "v1d": {
            "name": "Ichidan verb, dictionary form",
            "partsOfSpeech": ["v1"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "一段動詞、辞書形"
                }
            ]
        },
        "v1p": {
            "name": "Ichidan verb, progressive or perfect form",
            "partsOfSpeech": ["v1"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "一段動詞、進行形または完了形"
                }
            ]
        },
        "v5": {
            "name": "Godan verb",
            "partsOfSpeech": "v5",
            "i18n": [
                {
                    "language": "ja",
                    "name": "五段動詞"
                }
            ]
        },
        "vk": {
            "name": "Kuru verb",
            "partsOfSpeech": "vk",
            "i18n": [
                {
                    "language": "ja",
                    "name": "来る動詞"
                }
            ]
        },
        "vs": {
            "name": "Suru verb",
            "partsOfSpeech": "vs",
            "i18n": [
                {
                    "language": "ja",
                    "name": "する動詞"
                }
            ]
        },
        "vz": {
            "name": "Zuru verb",
            "partsOfSpeech": "vz",
            "i18n": [
                {
                    "language": "ja",
                    "name": "ずる動詞"
                }
            ]
        },
        "adj-i": {
            "name": "Adjective with i ending",
            "partsOfSpeech": ["adj-i"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "形容詞"
                }
            ]
        },
        "iru": {
            "name": "Intermediate -iru endings for progressive or perfect tense",
            "partsOfSpeech": []
        }
    },
    "transforms": [
        {
            "name": "-ba",
            "description": "Conditional",
            "i18n": [
                {
                    "language": "ja",
                    "name": "ば",
                    "description": "仮定形"
                }
            ],
            "variants": [
                {"suffixIn": "ければ", "suffixOut": "い", "rulesIn": [], "rulesOut": ["adj-i"]},
                {"suffixIn": "えば", "suffixOut": "う", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "けば", "suffixOut": "く", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "げば", "suffixOut": "ぐ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "せば", "suffixOut": "す", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "てば", "suffixOut": "つ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "ねば", "suffixOut": "ぬ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "べば", "suffixOut": "ぶ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "めば", "suffixOut": "む", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "れば", "suffixOut": "る", "rulesIn": [], "rulesOut": ["v1", "v5", "vk", "vs", "vz"]}
            ]
        }
    ]
}

A few notes:

Thoughts on naming:

Overall, I'm not sure what the best naming strategy is for everything in here, so I'm open to suggestion. Primarily, I'm not sure if "rule" is a good name for how it's being used here. Similar for "variants", but I couldn't immediately think of anything that is more clear. I tried to avoid having both "rule" and "reason" since I think the two can be easily confused. So some of the current types I'm looking at for the raw JSON file would be something like:

Again please provide any thoughts on alternate ways to name these.

Related links:

Casheeew commented 6 months ago

Thoughts on naming:

Overall, I'm not sure what the best naming strategy is for everything in here, so I'm open to suggestion. Primarily, I'm not sure if "rule" is a good name for how it's being used here. Similar for "variants", but I couldn't immediately think of anything that is more clear. I tried to avoid having both "rule" and "reason" since I think the two can be easily confused. So some of the current types I'm looking at for the raw JSON file would be something like:

  • Transformation
  • TransformationVariant
  • TransformationRule

Again please provide any thoughts on alternate ways to name these.

Related links:

I think a good name should express that Transformation should be bigger than TransformationVariant. I was pretty confused as to what TransformationVariant was supposed to be when I first read the deinflector code (I didn't think it as a subclass of transformations)

some names i can think of: Transformation and AtomicTransformation (inspired from Rust and cpp, also has the benefit that atomics mean that this is the smallest case possible (so the relationship can be easily inferred from Transformation or TransformationChain etc)) TransformationGroup and Transformation Transformation and TransformationCase