tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
811 stars 84 forks source link

[error] derived terms are skipped in hierarchical entries (like Proto-Indo-European) #148

Closed alexchandel closed 1 year ago

alexchandel commented 2 years ago

Unlike living languages, Proto-Indo-European words' derived terms are given in a hierarchical list, rather than a flat list. But wiktextract/kaikki only picks up the "deepest" entry. For example, in the PIE term sek-, the first derived term (reproduced below) should be "sek-eh₂-yé-ti" and the second "sekajō", but these is skipped for "secō". These intermediate derived terms should not be skipped over.

alexchandel commented 2 years ago

@kristian-clausal Where in the code is this handled?

kristian-clausal commented 2 years ago

I assume it's not, at least for this kind of list. We're prioritizing living and attested languages, so most peculiarities in proto-language and Reconstruction: category entries are just run through as if they were normal entries in a normal Wiktionary article, hence why they're handled badly.

You probably should not rely on data in Reconstruction: entries generated by wiktextract. This may change in the future, but we have a lot on our plates still with everything else and creating or interweaving all the code needed specifically to parse Reconstruction: entries is a lot.

jmviz commented 2 years ago

@alexchandel In case you're interested in running wiktextract locally to get this information, I have a fork that outputs basic data for Descendants and PIE Derived terms/Extensions sections. It outputs an array of objects corresponding to each line in the list, that each have data like wiktextract's etymology_templates/etymology_text. Then there is a depth key to record the level of nesting of the line. Since the objects are in the same order as the lines in the wikitext, you can recover the proper full tree structure by tracking the depth while iterating through the objects. Here's what the beginning of the output for sek- looks like:

"descendants": [
    {
      "depth": 1,
      "tags": [
        "derived"
      ],
      "templates": [
        {
          "args": {
            "1": "ine-pro",
            "2": "",
            "3": "*sek-eh₂-yé-ti"
          },
          "expansion": "*sek-eh₂-yé-ti",
          "name": "l"
        },
        {
          "args": {
            "1": "ine-pro",
            "2": "",
            "3": "*sek-h₁-yé-ti"
          },
          "expansion": "*sek-h₁-yé-ti",
          "name": "l"
        }
      ],
      "text": "*sek-eh₂-yé-ti or *sek-h₁-yé-ti"
    },
    {
      "depth": 2,
      "templates": [
        {
          "args": {
            "1": "itc-pro",
            "2": "*sekajō"
          },
          "expansion": "Proto-Italic: *sekajō",
          "name": "desc"
        }
      ],
      "text": "Proto-Italic: *sekajō"
    },
    {
      "depth": 3,
      "templates": [
        {
          "args": {
            "1": "la",
            "2": "secō"
          },
          "expansion": "Latin: secō",
          "name": "desc"
        },
        {
          "args": {},
          "expansion": "(see there for further descendants)",
          "name": "see desc"
        }
      ],
      "text": "Latin: secō (see there for further descendants)"
    },
    {
      "depth": 1,
      "tags": [
        "derived"
      ],
      "templates": [
        {
          "args": {
            "1": "ine-pro",
            "2": "*skey-",
            "3": "*sk-éy-ti",
            "pos": "*éy-present"
          },
          "expansion": "*sk-éy-ti (*éy-present)",
          "name": "l"
        }
      ],
      "text": "*sk-éy-ti (*éy-present)"
    },
alexchandel commented 2 years ago

Would be nice to merge this fork. "Descendants," "Extensions," "Derived terms" are all standard sections according to Wiktionary's entry layout guidelines.