themoeway / kaikki-to-yomitan

Yomitan-compatible dictionaries from wikitionary data
https://github.com/themoeway/kaikki-to-yomitan/releases
36 stars 7 forks source link

Dictionary structure #55

Open daxida opened 1 month ago

daxida commented 1 month ago

I'm sorry if this is not the right place to ask.

I recently found this repository via wiktextract and I would like to do something similar for another API. I was browsing for a while but I could not find a description of the JSON entries that are used, like this one here.

I understand that the expected dictionary from Yomitan is something of the likes of:

[
  [
    word,
    "",  # what is this?
    "v vt", # some grammar tag but not sure about the difference with the next one.
    "v",
    0, # what is this?
    list of translations,
    0, # what is this?
    "" # what is this?
  ],
  etc.
]

Could you give me some headers? I've also tried the Yomitan repo but I could not find much information about it. Maybe it's a standard dictionary format that I'm unaware of?

StefanVukovic99 commented 1 month ago

It's described by a schema in yomitan: https://github.com/themoeway/yomitan/blob/master/ext/data/schemas/dictionary-term-bank-v3-schema.json Some of the fields are pretty much obsolete I'd say. There's other schemas in that folder for the IPA, index.json etc.

Just to make sure you're not doing more than you need, this is converting something other than kaikki, i.e. this is separate from https://github.com/tatuylonen/wiktextract/discussions/651?

daxida commented 1 month ago

Thank you for the link.

I'm unfortunately still having trouble parsing that schema. Is it obvious from that what maps to what in the lines that I previously commented?

And thank you for your concern: this is a separate matter. I didn't mention it before because I was afraid to be instantly dismissed for being out of topic. I'm toying with the idea of making a Yomitan-compatible dictionary like yours from a website called lingq.

Their entries are very simple in comparison to that schema:

{
  "pk": 459243703,
  "url": "https://www.lingq.com/api/v3/el/cards/459243703/",
  "term": "εκφώνησής",
  "fragment": "διαδικασία προγραμματισμού της εκφώνησής σας, να",
  "importance": 0,
  "status": 0,
  "extended_status": null,
  "last_reviewed_correct": null,
  "srs_due_date": "2023-09-12T08:26:23.907721",
  "notes": "",
  "audio": null,
  "words": [
    "εκφώνησής"
  ],
  "tags": [],
  "hints": [
    {
      "id": 129173102,
      "locale": "en",
      "text": "of reading (aloud)",
      "term": "εκφώνησής",
      "popularity": 2,
      "is_google_translate": true,
      "flagged": false
    }
  ],
  "transliteration": {
    "latin": [
      "ekfonisis"
    ]
  },
  "gTags": [],
  "wordTags": [],
  "readings": {

  },
  "writings": [
    "εκφώνησής",
    "εκφωνησης"
  ]
}

There are some things like fragments (sort of "example sentence") that I'm still not sure where to put.

StefanVukovic99 commented 1 month ago

Here's some more details:

[
  "居住者",
  "きょじゅうしゃ",
  "n",
  "",
  604,
  [
    "resident",
    "inhabitant"
  ],
  1717870,
  "P news"
]

Screenshot from 2024-06-03 22-21-12

  1. The term/expression/headword
  2. The "reading" - in Japanese, this is the term in kana, used to disambiguate readings. In other languages it can be used in a similar way, or display the term with optional diacritics. E.g. in latin and farsi: latin occido farsi
  3. Definition tags - these are abbreviations refering to the full tags defined in tag_bank_1.json. They can be about the part of speech, but also usage qualifiers (rare, archaic, vulgar...), field (law, biology, astronomy...), region (British/American and such). When you click on them, the full tag name is shown.
  4. Rule identifiers - these refer to the conditions defined in the "transforms" (aka deinflections) file for that language, see english-transforms.js , and help deinflection be more precise. If a language has no deinflection yet, they are unnecessary.
  5. Score - basically vestigial IMO, obsoleted by freq dict use for sorting
  6. Array of definitions. Note that these can be simple strings, but also "structured" (HTML, lets you make fancy definitions. Might want to use this to format your example sentences and whatnot) and "deinflection definitions" (redirects to another dict entry. can be used for conjugated forms, alternate written forms...)
  7. Sequence number -
    {
    "type": "integer",
    "description": "Sequence number for the term. Terms with the same sequence number can be shown together when the \"resultOutputMode\" option is set to \"merge\"."
    },

    idk really, probably safe to ignore, just set it to 0

  8. These tags also refer to the tag_bank, but they are supposed to be related to the term, not the definition (See first image). I don't think these see much use these days.