Open soshial opened 5 years ago
Same solution might help us to get rid of <opt>
, which I think, introducing <opt>
and putting it inside <k>
was a mistake. For cases as:
Maybe a good solution would be making it look like this:
<k xml:lang="en-US">the United States</k>
<k xml:lang="en-US" type="indexable_as">United States</k>
Proposed solution: We allow putting
<k>
with and without a specification, which language or script or country variant this<k>
is
I fully agree.
How to encode language and scruipts? The most reasonable and taking the least amount of work is to use BCP47 standard to support various writing systems.
I agree this is probably the best option. I can see three issues:
<k lang="en-US">
looks much better than <k xml:lang="en-US">
, especially as we don't have any other "xml:" standard tags.But:
What to do with multilingual dictionaries?
- we should change
lang_to
andlang_from
to supportxml:lang
Agreed.
So I guess we will have to create new
<!ELEMENT>
inside meta_info. Am I wrong?
I'm not sure, I think you're right. Regardless, all other attributes in the root element refer to the format itself (format; revision), all details of the dictionary content are in meta_info
so it might simply be more logical to move the language definition there anyway, and so this wouldn't be a question anymore. lang_to
and lang_from
are properties of the dictionary, not of the format.
For this reason, I don't think that we need additional tags
(for transliteration) and (for pronunciation).
Here I disagree. Transliteration, pronunciation and key-phrases are very different semantically and should be handled differently and generally be displayed differently by the DS. This is only possible defined separately.
(I will be using Chinese as an example, the situation should be similar in other non-phonetic scripts.)
<k>
should be the key-phrase itself, how it would appear in a dictionary. One <ar>
can have more than one <k>
if they are equivalent, such as different scripts. Transliteration is the way you input the word when you're searching if you don't know how the word is written (or just because it can be faster to type it than to pick the character, if you know it), or the information of how the word is read, if you know how it written but not how it is pronounced.
E.g.: the word 词 (zh-Hans) or 詞 (zh-Hant), meaning “word”, is pronounced cí (zh-Latn-pinyin). 词 and 詞 are exactly the same word, cí is just how it is pronounced; it's not a key-phase, a dozen other words have the same pronunciation. So if I type 词 the DS should show this word's entry but if I type cí it should show a list of entries with the same pronunciation, including 瓷 porcelain, 雌 female, 磁 magnetism, 慈 compassion, etc., all of which are pronounced cí. All of these words are <k>
headwords, cí is just how they are pronounced. Phonetic transliteration may also have different notations in common use, so if I type ci2, the DS should understand that as cí, i.e. ci in the second tone. If I simply type ci it should list all words pronounced ci in any tone including 次 cì, "time", 刺 cì, "to pierce", and 此 cǐ, "this". If I type nv, it should recognize it as "nü", since v is how "ü" is usually input in Chinese IMEs. All of this is a basic feature of any Chinese DS, and very easy to implement in the DS (all of what I mentioned is fairly regular and common in Chinese input) but only if the pinyin is defined as a transliteration of the <k>
headword. You will notice what I mentioned doesn't have to be applied to the headword, which should be indexed as is.
Also visually these tags are different, <k>
and the suggested <tl>
should also be displayed differently, most users would expect the headword to be displayed predominantly, in a significantly larger font, on top of the article, but not the transliteration.
In short, I believe the transliteration is semantically (very) different from a key-phrase and should be displayed differently by the DS. It is not a headword and should not be defined as such. The transliteration is very important for non-phonetic witting systems, and defining them separately from the headword can make it easier for DS developers to support these languages better.
Transcription and pronunciation are different from transliteration: they are used for languages that already use alphabetic scripts, they're not nearly as important (many dictionaries don't have them), they're not standardised (different dictionaries will have different pronunciations for the same word, even if publish in the same country), it's not expected the DS will recognize different ways of inputting it (such as v for ü, ou ci2 for cí ) and, at least in the case of IPA transcription many (maybe most) users don't even know how to read it. Clearly, they should not be indexed (at least by default), as no one looks up a word by pronunciation in European languages. These two ways of representing may possibly be defined by the same element with different attributes, but it should not be <k>
, transcription and pronunciation are not headwords and should not be defined as such.
(Note:
<tr>
, <pr>
, and <tl>
are simply possible element names, it's not the names I'm proposing but the existence of these elements.<opt>
If I understood this correctly, the point of <opt>
is to define how the headword would be sorted in a list, so headwords with articles aren't sorted by the article (correct me if I'm wrong). If so:
<k xml:lang="en-US" sort_as="United States, the">the United States</k>
I find this much more clear. The DS would list it under U, not T, and if a user started typing "United", he should still easily find this entry.
[I'm sorry I'm not able to reply in more timely manner. I am overloaded with work, and it took me several days just to reply to this comment of yours. And it might still not be very clear, you know, "if I had the time, I'd write a shorter reply". I am reading the comments of this repo as they are made, I just need time to reply.]
xml:lang
vs lang
First, it might be an interesting article about when to use language attributes in XML tags.
Also, I am not sure what is the right way to encode this tag in our DTD. If I write it as <!ATTLIST k xml:lang CDATA #IMPLIED>
, then for validator xml:lang
would be just another CDATA field. In one place it is recommended to write as <!ATTLIST k xml:lang NMTOKEN #IMPLIED>
.
But there must be another way to encode it strictly, not just as some string. Maybe it is possible to import another standart (that contains BCP47 info) or schema and link to it?
UPD. I have asked on SO, maybe someone knows better how to write DTD.
First, it might be an interesting article about when to use language attributes in XML tags.
I read that article before. If I understand it correctly, it's appropriate use xml:lang
when it is meant to define the language of the textual content of the element it is an attribute to. That means it's appropriate <k>
and (the suggested) <tl>
tags, etc. not not to define the languages of the dictionary.
Also, I am not sure what is the right way to encode this tag in our DTD.
You are much more knowledgeable about this than me so I'm going to give opinions about this.
I think an actual example with different writing systems (simplified and traditional Chinese), variant pronunciations (Mainland and Taiwan) and different transliterations (pinyin and bopomofo) would be helpful. This is how the 各个, "every", entry from Cross-straight Dictionary looks in GoldenDict in my current conversion:
I've had to add 【陸】 Mainland and 【臺】 Taiwan before the pronunciations, simplified/traditional characters are not identified.
I believe in your current proposal (and retaining my insistence on transliteration element) this would be:
<ar>
<k xml:lang="zh-Hans">各个</k>
<k xml:lang="zh-Hant">各個</k>
<tl xml:lang="zh-Latn-pinyin-CN">ɡèɡè</tl>
<tl xml:lang="zh-Latn-pinyin-TW">ɡèɡe</tl>
<tl xml:lang="zh-Bopo-CN">ㄍㄜˋ ㄍㄜˋ</tl>
<tl xml:lang="zh-Bopo-TW">ㄍㄜˋ ˙ㄍㄜ</tl>
<def>
<def>
<deftext>每一個。</deftext>
<ex type="exm">
<exorig>~角落</exorig>
</ex>
<ex type="exm">
<exorig>~團體。</exorig>
</ex>
</def>
<def>
<deftext>逐一;一個個。</deftext>
<ex type="exm">
<exorig>~擊破</exorig>
</ex>
<ex type="exm">
<exorig>將問題~提出討論並解決。</exorig>
</ex>
</def>
</def>
</ar>
Is this right?
This works for me as it allows for script variants and transliterations, as well regional differences. It is already a huge improvement for languages in non-alphabetic scripts. However, this is what I would prefer (while still using the BCP47-mandated ISO codes):
<ar>
<k script="Hans">各个</k>
<k script="Hant">各個</k>
<tl system="pinyin" region="CN">ɡèɡè</tl>
<tl system="pinyin" region="TW">ɡèɡe</tl>
<tl system="Bopo" region="CN">ㄍㄜˋ ㄍㄜˋ</tl>
<tl system="Bopo" region="TW">ㄍㄜˋ ˙ㄍㄜ</tl>
<def>
[...]
</def>
</ar>
This is because it just looks much more clear but also because it states planinly what is being defined. zh-Latn-pinyin-CN
is harder to interpret both for a DS and for a human, system="pinyin" region="CN"
leaves no question to be asked. But, again, I see the benefit of simply applying BCP47 directly.
I think that it is important to introduce new tags slowly, since dictionary software is not very fast to accommodate changes. I have 2 solutions:
<k>
tag and leave the logic of showing transliteration/romanization correctly (based on xml:lang attribute) to the dictionary software?type="tl"
like this: <k type="tl" xml:lang="zh-Latn-pinyin-CN">ɡèɡè</k>
. And also add a couple of other types: spelling variant, historical spelling etc.<ar>
<k xml:lang="zh-Hans">各个</k>
<k xml:lang="zh-Hant">各個</k>
<k xml:lang="zh-Latn-pinyin-CN">ɡèɡè</k>
<k xml:lang="zh-Latn-pinyin-TW">ɡèɡe</k>
<k xml:lang="zh-Bopo-CN">ㄍㄜˋ ㄍㄜˋ</k>
<k xml:lang="zh-Bopo-TW">ㄍㄜˋ ˙ㄍㄜ</k>
<def>
...
</def>
</ar>
Either way, all <k>
will still be shown by the old DS.
Here is a list of proposals :
1. Writing systems and scripts
@k-sl wrote:
Proposed solution: We allow putting
<k>
with and without a specification, which language or script or country variant this<k>
is:How to encode language and scripts? The most reasonable and taking the least amount of work is to use BCP47 standard to support various writing systems.
What to do with multilingual dictionaries?
lang_to
andlang_from
to supportxml:lang
and allow us to encode several languages for multilingual dictionaries. This is not possible with<!ATTLIST>
I think. So I guess we will have to create new<!ELEMENT>
insidemeta_info
. Am I wrong?For this reason, I don't think that we need additional tags
<tl>
(for transliteration) and<pr>
(for pronunciation).