tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
785 stars 82 forks source link

Should senses be nested under etymologies? #117

Closed Manishearth closed 2 years ago

Manishearth commented 2 years ago

A thing I'm noticing is that the "senses" and "etymologies" entries are stored separately. This seems a bit weird: Wiktionary organizes senses by etymology, so should it not instead list the etymology as a part of its corresponding sense?

I may be misunderstanding what's going on here.

Manishearth commented 2 years ago

Actually I guess this matters most for Chinese characters: The senses should nest under etymology, or at least be associated with numbered etymologies

Manishearth commented 2 years ago

(in Chinese quite often different etymologies are very different words)

Manishearth commented 2 years ago

Oh, I think I see how this is handled now. Each etymology gets split into its own separate dictionary entry. Good to know, this is exactly the desired behavior!

There are two topleve

{"senses": [{"raw_glosses": ["(dialectal Mandarin, Cantonese, dialectal Gan, Hakka, dialectal Wu, Xiang) to not have; to not exist"], "examples": [{"text": "你冇女朋友?梗系唔信啦!……或者有。你唔知啫。 [Cantonese, simp.]From: 陳慧嫻, 紅茶館nei⁵ mou⁵ neoi⁵ pang⁴ jau⁵? gang² hai⁶ m⁴ seon³ laa¹!...... waak⁶ ze² jau⁵. nei⁵ m⁴ zi¹ ze¹. [Jyutping]You don't have a girlfriend? Of course I don't believe you! ... Maybe you do have one, but you just don't know it.", "ref": "你冇女朋友?梗係唔信啦!……或者有。你唔知啫。 [Cantonese, trad.]", "type": "example"}], "categories": ["Cantonese Chinese", "Cantonese terms with quotations", "Gan Chinese", "Hakka Chinese", "Mandarin Chinese", "Wu Chinese", "Xiang Chinese"], "tags": ["Cantonese", "Gan", "Hakka", "Mandarin", "Wu", "Xiang", "dialectal"], "glosses": ["to not have; to not exist"]}, {"raw_glosses": ["(dialectal Mandarin, Cantonese, Xiang) have not; did not (do something) (indicating non-completion of a verb)"], "examples": [{"text": "我冇同佢講。 / 我冇同佢讲。 [Cantonese] ― ngo⁵ mou⁵ tung⁴ keoi⁵ gong². [Jyutping] ― I did not tell him.", "type": "example"}, {"text": "佢几个月都冇嚟。 [Cantonese, simp.]keoi⁵ gei² go³ jyut⁶ dou¹ mou⁵ lai⁴. [Jyutping]He hasn't come for a few months.", "ref": "佢幾個月都冇嚟。 [Cantonese, trad.]", "type": "example"}], "categories": ["Cantonese Chinese", "Cantonese terms with usage examples", "Mandarin Chinese", "Xiang Chinese"], "tags": ["Cantonese", "Mandarin", "Xiang", "dialectal"], "glosses": ["have not; did not (do something) (indicating non-completion of a verb)"]}, {"raw_glosses": ["(dialectal Cantonese, Nanning Pinghua) not (negator)"], "categories": ["Cantonese Chinese", "Nanning Pinghua"], "tags": ["Cantonese", "dialectal"], "glosses": ["not (negator)"]}], "pos": "character", "head_templates": [{"name": "head", "args": {"1": "zh", "2": "Han characters"}, "expansion": "冇"}, {"name": "zh-hanzi", "args": {}, "expansion": "冇"}], "categories": ["Chinese Han characters", "Chinese adjectives", "Chinese hanzi", "Chinese lemmas", "Chinese terms with IPA pronunciation", "Chinese verbs", "Kenny's testing category 2", "Requests for native script for Saek terms"], "etymology_text": "From 無 (MC mɨo, “to not have”), fused with 有 (MC ɦɨu^X, “to have”) (Schuessler, 2007).", "etymology_templates": [{"name": "ltc-l", "args": {"1": "無", "2": "to not have"}, "expansion": "無 (MC mɨo, “to not have”)"}, {"name": "ltc-l", "args": {"1": "有", "2": "to have"}, "expansion": "有 (MC ɦɨu^X, “to have”)"}, {"name": "zh-ref", "args": {"1": "Schuessler, 2007"}, "expansion": "Schuessler, 2007"}], "sounds": [{"ipa": "/mo¹¹/"}], "word": "冇", "lang": "Chinese", "lang_code": "zh"}