soshial / xdxf_makedict

XDXF — an open and free dictionary format, that stores word articles in a structural and semantic way. The most convertible format
223 stars 54 forks source link

[i18n] Support of non-european languages and non-latin scripts #36

Open soshial opened 5 years ago

soshial commented 5 years ago

Here is a list of proposals :

1. Writing systems and scripts

@k-sl wrote:

<k>: This element must support defining different scripts/writing systems for the same "key phrase". This is different from different spellings, in that the user should be allowed to choose in the DS settings which script they prefer and the DS should only display the chosen script, and not show all key phases repeated as may times as there are scripts. The DS may (possibly should) display the other script(s) with the definition text, as it does with transcription, etymology, etc. If the user searches correctly for a word in different script than the one they chose, the entry should be displayed with the word in the chosen script as headword. An obvious example is Chinese: it is common to have all entries in both main variants, simplified and traditional Chinese, which, with the current format, means all entries are doubled in the DS, and a user will have to sift through the simplified entries, even if they only read traditional Chinese (and vice-versa). The same is true for any other language which can be written in more than one script/writing script (either because different areas speaking the same language use different scripts or because it used to be written in a different script and the dictionary includes both variants).

The dictionary needs to have both both simplified and traditional Chinese headwords; you need to be able to look up a word in any of the two standards, regardless of which variant is used for the definitions. You also need to be able to see the characters used in the alternative standard when looking up a word. Your suggestion would mean all entries for which simplified and traditional characters are the same would be repeated and that, when looking up a word, the reader would have no way to know how the word is written in the other standard. Besides, most of what I'm describing already works fine in XDXF, I just add both <k> tags to to each article on my Chinese dictionaries and I've been using them like this for years. The problem is there is no way to define which is which, something that should be defined semantically, so the DS can show which is which, hide one if the reader wants to do so, and show the preferred version first, in all Chinese dictionaries.

Proposed solution: We allow putting <k> with and without a specification, which language or script or country variant this <k> is:

<k xml:lang="zh-Hans">词典</k>
<k xml:lang="zh-Hant">詞典</k>
<k xml:lang="zh-Latn-pinyin" type="transliteration">Zhōngguó</k>
<k xml:lang="zh-Latn-wadegile" type="transliteration">Chung1-kuo2</k>
<k xml:lang="zh-Latn-pinyin" type="indexable_as">zhongguo</k>

How to encode language and scripts? The most reasonable and taking the least amount of work is to use BCP47 standard to support various writing systems.

What to do with multilingual dictionaries?

For this reason, I don't think that we need additional tags <tl> (for transliteration) and <pr> (for pronunciation).

soshial commented 5 years ago

Same solution might help us to get rid of <opt>, which I think, introducing <opt> and putting it inside <k> was a mistake. For cases as:

Maybe a good solution would be making it look like this: <k xml:lang="en-US">the United States</k> <k xml:lang="en-US" type="indexable_as">United States</k>

k-sl commented 5 years ago

Proposed solution: We allow putting <k> with and without a specification, which language or script or country variant this <k> is

I fully agree.

How to encode language and scruipts? The most reasonable and taking the least amount of work is to use BCP47 standard to support various writing systems.

I agree this is probably the best option. I can see three issues:

  1. It looks inelegant and out of place. <k lang="en-US"> looks much better than <k xml:lang="en-US"> , especially as we don't have any other "xml:" standard tags.
  2. It feels redundant, it is not the language that requires specification (it is currently defined in the root tag) but the writing system or locale. It feels unnecessary to tag each keyword of an English dictionary as English, just be able to note some as US spelling, for example, but the BCP47 standard requires this.
  3. (Much more important) I think this won't cover all possible writing systems.

But:

  1. It is just too much work to define each and every script in the XDXF DTS when essentially it would just mirror the BCP47 standard.
  2. It might eventually be necessary to submit a request to add a script to the standard or to use private-use tags, but this standard is still much more extensive than anything we could come up with by ourselves, and would be a very significant improvement for the XDXF format already as-is.

What to do with multilingual dictionaries?

  • we should change lang_to and lang_from to support xml:lang

Agreed.

So I guess we will have to create new <!ELEMENT> inside meta_info. Am I wrong?

I'm not sure, I think you're right. Regardless, all other attributes in the root element refer to the format itself (format; revision), all details of the dictionary content are in meta_info so it might simply be more logical to move the language definition there anyway, and so this wouldn't be a question anymore. lang_to and lang_from are properties of the dictionary, not of the format.

For this reason, I don't think that we need additional tags (for transliteration) and (for pronunciation).

Here I disagree. Transliteration, pronunciation and key-phrases are very different semantically and should be handled differently and generally be displayed differently by the DS. This is only possible defined separately.

(I will be using Chinese as an example, the situation should be similar in other non-phonetic scripts.)

<k> should be the key-phrase itself, how it would appear in a dictionary. One <ar> can have more than one <k> if they are equivalent, such as different scripts. Transliteration is the way you input the word when you're searching if you don't know how the word is written (or just because it can be faster to type it than to pick the character, if you know it), or the information of how the word is read, if you know how it written but not how it is pronounced.

E.g.: the word 词 (zh-Hans) or 詞 (zh-Hant), meaning “word”, is pronounced cí (zh-Latn-pinyin). 词 and 詞 are exactly the same word, cí is just how it is pronounced; it's not a key-phase, a dozen other words have the same pronunciation. So if I type 词 the DS should show this word's entry but if I type cí it should show a list of entries with the same pronunciation, including 瓷 porcelain, 雌 female, 磁 magnetism, 慈 compassion, etc., all of which are pronounced cí. All of these words are <k> headwords, cí is just how they are pronounced. Phonetic transliteration may also have different notations in common use, so if I type ci2, the DS should understand that as , i.e. ci in the second tone. If I simply type ci it should list all words pronounced ci in any tone including 次 cì, "time", 刺 cì, "to pierce", and 此 cǐ, "this". If I type nv, it should recognize it as "nü", since v is how "ü" is usually input in Chinese IMEs. All of this is a basic feature of any Chinese DS, and very easy to implement in the DS (all of what I mentioned is fairly regular and common in Chinese input) but only if the pinyin is defined as a transliteration of the <k> headword. You will notice what I mentioned doesn't have to be applied to the headword, which should be indexed as is.

Also visually these tags are different, <k> and the suggested <tl> should also be displayed differently, most users would expect the headword to be displayed predominantly, in a significantly larger font, on top of the article, but not the transliteration.

In short, I believe the transliteration is semantically (very) different from a key-phrase and should be displayed differently by the DS. It is not a headword and should not be defined as such. The transliteration is very important for non-phonetic witting systems, and defining them separately from the headword can make it easier for DS developers to support these languages better.

Transcription and pronunciation are different from transliteration: they are used for languages that already use alphabetic scripts, they're not nearly as important (many dictionaries don't have them), they're not standardised (different dictionaries will have different pronunciations for the same word, even if publish in the same country), it's not expected the DS will recognize different ways of inputting it (such as v for ü, ou ci2 for ) and, at least in the case of IPA transcription many (maybe most) users don't even know how to read it. Clearly, they should not be indexed (at least by default), as no one looks up a word by pronunciation in European languages. These two ways of representing may possibly be defined by the same element with different attributes, but it should not be <k>, transcription and pronunciation are not headwords and should not be defined as such.

(Note:

  1. <tr>, <pr>, and <tl> are simply possible element names, it's not the names I'm proposing but the existence of these elements.
  2. Of course transliteration (to a phonetic script) and (phonetic) transcription are also ways of representing pronunciation. I'm using the word pronunciation here for the way pronunciation is usually represented in English dictionaries, i.e. pronunciation respelling.)

If I understood this correctly, the point of <opt> is to define how the headword would be sorted in a list, so headwords with articles aren't sorted by the article (correct me if I'm wrong). If so:

  1. I'm not sure it is need at all. Most multimedia software (such as media players) are expected to ignore articles when listing titles, so it might not be too great a demand for the the DS developer to expect the same. The issue I see here is that while this would be trivial for English, it might not be for other languages.
  2. It seems more appropriate to define this as an attribute to the key-phrase, e.g.:
<k xml:lang="en-US" sort_as="United States, the">the United States</k>

I find this much more clear. The DS would list it under U, not T, and if a user started typing "United", he should still easily find this entry.

[I'm sorry I'm not able to reply in more timely manner. I am overloaded with work, and it took me several days just to reply to this comment of yours. And it might still not be very clear, you know, "if I had the time, I'd write a shorter reply". I am reading the comments of this repo as they are made, I just need time to reply.]

soshial commented 5 years ago

Choosing between xml:lang vs lang

First, it might be an interesting article about when to use language attributes in XML tags.

Also, I am not sure what is the right way to encode this tag in our DTD. If I write it as <!ATTLIST k xml:lang CDATA #IMPLIED>, then for validator xml:lang would be just another CDATA field. In one place it is recommended to write as <!ATTLIST k xml:lang NMTOKEN #IMPLIED>.

But there must be another way to encode it strictly, not just as some string. Maybe it is possible to import another standart (that contains BCP47 info) or schema and link to it?

UPD. I have asked on SO, maybe someone knows better how to write DTD.

k-sl commented 5 years ago

First, it might be an interesting article about when to use language attributes in XML tags.

I read that article before. If I understand it correctly, it's appropriate use xml:lang when it is meant to define the language of the textual content of the element it is an attribute to. That means it's appropriate <k> and (the suggested) <tl> tags, etc. not not to define the languages of the dictionary.

Also, I am not sure what is the right way to encode this tag in our DTD.

You are much more knowledgeable about this than me so I'm going to give opinions about this.

I think an actual example with different writing systems (simplified and traditional Chinese), variant pronunciations (Mainland and Taiwan) and different transliterations (pinyin and bopomofo) would be helpful. This is how the 各个, "every", entry from Cross-straight Dictionary looks in GoldenDict in my current conversion:

各个 I've had to add 【陸】 Mainland and 【臺】 Taiwan before the pronunciations, simplified/traditional characters are not identified.

I believe in your current proposal (and retaining my insistence on transliteration element) this would be:

<ar>
   <k xml:lang="zh-Hans">各个</k>
   <k xml:lang="zh-Hant">各個</k>
   <tl xml:lang="zh-Latn-pinyin-CN">ɡèɡè</tl>
   <tl xml:lang="zh-Latn-pinyin-TW">ɡèɡe</tl>
   <tl xml:lang="zh-Bopo-CN">ㄍㄜˋ ㄍㄜˋ</tl>
   <tl xml:lang="zh-Bopo-TW">ㄍㄜˋ ˙ㄍㄜ</tl>
   <def>
      <def>
         <deftext>每一個。</deftext>
         <ex type="exm">
            <exorig>~角落</exorig>
         </ex>
         <ex type="exm">
            <exorig>~團體。</exorig>
         </ex>
      </def>
      <def>
         <deftext>逐一;一個個。</deftext>
         <ex type="exm">
            <exorig>~擊破</exorig>
         </ex>
         <ex type="exm">
            <exorig>將問題~提出討論並解決。</exorig>
         </ex>
      </def>
   </def>
</ar>

Is this right?

This works for me as it allows for script variants and transliterations, as well regional differences. It is already a huge improvement for languages in non-alphabetic scripts. However, this is what I would prefer (while still using the BCP47-mandated ISO codes):

<ar>
   <k script="Hans">各个</k>
   <k script="Hant">各個</k>
   <tl system="pinyin" region="CN">ɡèɡè</tl>
   <tl system="pinyin" region="TW">ɡèɡe</tl>
   <tl system="Bopo" region="CN">ㄍㄜˋ ㄍㄜˋ</tl>
   <tl system="Bopo" region="TW">ㄍㄜˋ ˙ㄍㄜ</tl>
   <def>
      [...]
   </def>
</ar>

This is because it just looks much more clear but also because it states planinly what is being defined. zh-Latn-pinyin-CN is harder to interpret both for a DS and for a human, system="pinyin" region="CN" leaves no question to be asked. But, again, I see the benefit of simply applying BCP47 directly.

soshial commented 2 years ago

I think that it is important to introduce new tags slowly, since dictionary software is not very fast to accommodate changes. I have 2 solutions:

  1. We put both words and transliteration/romanization variants inside the same <k> tag and leave the logic of showing transliteration/romanization correctly (based on xml:lang attribute) to the dictionary software?
  2. We add type="tl" like this: <k type="tl" xml:lang="zh-Latn-pinyin-CN">ɡèɡè</k>. And also add a couple of other types: spelling variant, historical spelling etc.
<ar>
   <k xml:lang="zh-Hans">各个</k>
   <k xml:lang="zh-Hant">各個</k>
   <k xml:lang="zh-Latn-pinyin-CN">ɡèɡè</k>
   <k xml:lang="zh-Latn-pinyin-TW">ɡèɡe</k>
   <k xml:lang="zh-Bopo-CN">ㄍㄜˋ ㄍㄜˋ</k>
   <k xml:lang="zh-Bopo-TW">ㄍㄜˋ ˙ㄍㄜ</k>
   <def>
       ...
   </def>
</ar>

Either way, all <k> will still be shown by the old DS.