soshial / xdxf_makedict

XDXF — an open and free dictionary format, that stores word articles in a structural and semantic way. The most convertible format
227 stars 52 forks source link

XDXF format comments, suggestions and needed corrections #30

Open k-sl opened 7 years ago

k-sl commented 7 years ago

@soshial and whoever else is involved in developing the XDXF format standard:

I recently realized that there are a number of great free Chinese and Japanese dictionaries but unfortunately each is made available in its own specific format, which means it takes a specific tool to read it. This made me start looking for a good dictionary format (preferably XML) that could be used for any language. I found that format in XDXF, which I do consider is the closest we have to an ideal open and global dictionary format standard. As I set out to write a converter for the Chinese-English CC-CEDICT dictionary, I unfortunately also noticed many problems with the format, some of those serious enough to prevent a good dictionary conversion (from non-alphabetic languages), some just minor (or major) inconveniences.

What follows is a series of comments, corrections, criticism and proposals about the XDXF standard.

There are four main points where I think improvement is needed for XDXF to fully achieve it's purpose: format (visual format needs to be completely dropped), file structure (there should actually be two different formats, a flat XML and a package), deeper semantic definition, and better support for non-European and non-alphabetic languages (especially multiple writing systems and transliterations).

It is important that this format must be able to display all information commonly found in a dictionary, be it paper or electronic and from any to any language.

(Any XML markup in the following suggestions that is not currently in the XDXF standard is just a suggestion. I am in no way implying that it should be the final version.)


I. Format

What makes XDXF stand out when compared to other formats is the ability to describe a dictionary in a semantic format. That is what XDXF brings to the table that previous dictionary formats cannot compete with. A stardict dictionary converted to visual XDXF may still be technically an improvement, but it'll be barely noticeable and so it doesn't make much sense to go through the trouble of converting it, when that is the most supported format anyway. (The same is true about other visual formats.)

I propose that the visual format be dropped from the XDXF standard; dictionaries in the visual format should be considered obsolete and no longer supported. I understand that, at least in part, the reason for a visual format is that it allows almost seamless conversion from most other dictionary format. Its discontinuation would make many-to-many converters much harder to write (if possible at all), as all information needs to be parsed. As I argued above, I don't believe being an easy target of conversion is worth much if there isn't a significant improvement of some form, either to the DS maintainers, or to the users. I think it reasonable to suggest that DS keep support for deprecated revision 33 as a way to keep supporting the visual format dictionaries that may be available in the wild.

It is not enough to mention the visual format is not supported, it cannot be part of the most recent revision.

A beneficial side effect is that this will make the XML definition much clearer, as it won't be defining what in effect are two different formats.

II. File Structure

The structure seems very confusing. On the one hand, it seems to be trying to describe an XML format for a dictionary, in the classical meaning for a dictionary: a list of words/phrases with their corresponding description and, possibly, additional metadata. On the other, it describes what could be called a DS "dictionary package" with things that aren't actually traditionally part of a dictionary, like toolbar icons, images, sound files, a folder structure, etc.

I agree that it is reasonable to try to accommodate both interpretations of what a dictionary is and an electronic dictionary should be, but a better solution would be do develop a standard for two different formats: (i) a "flat" XML dictionary format containing only the textual data that traditionally constitutes a dictionary and the metadata to describe it, written to a file with an identifying file name and .xdxf as extension; and (ii) a "dictionary package" possibly modeled after epub or opendocument. That is, essentially a zipped archive with an (XML) index file indicating the contents of the package, which must include one or more xdxf files (allowed only when they're related, e.g.Larousse English-French French-English dictionary would consist of two .xdxf files, the Oxford English Dictionary of only one). Icons, images, and other non-textual should be part of the package, correctly arranged in folders and indicated in the index. An index of images, for example, would indicate their relative location (by default /images/), their file name and the words/phrases under which they should appear. Textual information that is not part of the dictionary per se but that is traditionally part of dictionaries can also be included in pre-defined xml formatted files. More on that below.

In fact, the current dict.xdxf XML file in a folder with a more less defined name and optionally toolbar icons (for a simple dictionary) is an overly complicated, non-practical structure that is not easy to implement. (Imagine any other common file types in a similar structure; MP3, PDF, DOC all with the same name in folder with toolbar icons... who would want to use them?) In fact, notice how all DS that support XDXF will already gladly accept a simple .xdxf file regardless of its name, as is much more intuitive. A clear name identifying the dictionary and its edition/version should be recommended for practical reasons, but is in effect unnecessary as the information is already in the metadata.

This format would also allow for including information which is commonly included with dictionaries (both paper and electronic). One important example is conjugation/declension tables; while these aren't part of key phrases definitions and shouldn't be part of the XDXF file itself, they are commonly included as part of dictionaries and should be represented on the XDXF package format. Conjugations should be included in an independent XML conjugations standard file (to be developed) and referenced in the index file. The DS can then appropriately place a button/link on the entries for which there is a conjugation table which will display the properly formatted information. See the XML conjugation format for French conjugation software Verbiste as an example of such file.

The XML conjugations/declensions can also be used by the DS to recognize, for example, conjugated forms of verbs and display the correct entry (even indicating what form it is).

Some "sub-dictionaries" should also be in their own XML file. For example, some dictionaries include a "name's dictionary" as an annex. The user should be allowed to enable or disable these kinds of "sub-dictionaries" in the DS.

Icons can only be recommended, not required, as they are in no way part of the dictionary. In fact no DS requires icons and few would make any use of them. The icon reference in the current revision seems tailor-made for GoldenDict and the specifications for a standard shouldn't be intended for any specific DS. Icons should be supported, though, in the dictionary package and for the DS that do make use of them. They should be in the appropriate folder and need to be better defined: what format(s) can be used?; which sizes can/must be present?; etc. The icon metadata should also be present in the main index file (possibly unnecessary is defaults are used).

A beneficial side effect of the zipped package format is the enormous size reduction. As XML formats require a constant repetition of opening and closing tags, files are inflated significantly, an inflation that is greatly reduced in a zipped archive. A significant example: the CC-CEDICT dictionary, with 114,959 entries takes 8.4 MB in its original minimalist format; when converted to XDXF it takes 31.4 MB, an almost 4-fold increase in size! A zipped CC-CEDICT file takes 3.3 MB, and the zipped XDXF-converted file only 4.3 MB, a minimal increase over the original file size. In fact, DS should be recommended to import zipped flat XDXF files directly, even when not part of an XDXF package.

III. XML Structure

1. Root Element

See format argument above.

2. <meta_info>

All elements should be clealy described.

3. Lexicon

IV. Other Comments on XDXF

About transliterations and written systems: I don't think there is an ISO (or ISO-like list) of these systems, it would however be extremely useful to have an official list for allowed systems. This would make it clear and easier for DS to handle it. A solution would be an XDXF official list for each of the two, with an official code for each system. It could be done by adding the official ISO transliteration (that is, one per language) and Unicode scripts (not the same a writing systems) and then add as appropriate.

Information Pages: To allow for the XDXF format to include all information that is traditionally part of a dictionary, I believe it's necessary to include a new element under the root element, something I would call "information pages", to allow for including things like introductions, prefaces, bibliographies, abbreviations, etc. All things that are normally part of a dictionary but aren't allowed in the XDXF standard yet. This element should allow for including the same style tags as the textual definitions for key phrases plus <h#> and <p>. The number of information pages should be very limited, this is not an ebook format.

XDXF Project

Some improvements need to happen with the XDXF project itself:

Related issues: #28, #6, #5

soshial commented 7 years ago

I would love to improve the format. I have several new ideas myself. If you would like to discuss all those privately (to avoid clutter here), PM me in Telegram (WhatsApp is less preferred). But let's discuss each step at a time, okay?

k-sl commented 7 years ago

I agree we should discuss each step at a time, I just didn't want to spam the repository opening a separate issue for each item. I don't use Telegram or WhatsApp, so if you want to discuss these issues privately we'll have to do it old school-style, by email. We can also do it here talking over one item at a time, open to anyone who might want to join in.

soshial commented 7 years ago

Let's separate each issue category into its own opened issue — I will open all corresponding issues. Let's start with organizational issues.

I will be able to start working on this in around 2 weeks. What is your email, btw?

k-sl commented 7 years ago

Great, that sounds like a plan. Sorry, I thought you could see my email address on the github emails. Feel free to contact me: aaa2b9ed at opayq.com .

Since we're at it, let me point out two two issues I forgot on the "thesis" above:

Sorry for adding more things here.

soshial commented 7 years ago

Hi, @k-sl, I have moved to another country, changed a job, so I had a lot of stuff to do. I think I will have time in about a month to start working on new XDXF. So let's keep in touch. Sorry for such a long time to wait, but we will do it. I am counting on your advice!

k-sl commented 7 years ago

I'm leaving this message here so you don't think I disappeared. I didn't before because I was trying to keep this from becoming an IM chat log. You don't have to apologise for anything, you have a job and so do I, whenever you have the time it would be great to work on this and, as long as I can, I will help.

soshial commented 7 years ago

I. Format

I agree that visual format brings confusion and imports an illusion that any format can be converted to XDXF. We should promote the main idea of convertibility XDXF to Many formats. The idea is to store dictionary in XDXF and then easily convert to any other format that is needed.

II. File structure

I am surprised that there is nothing about compressing dictionary file. It should not be obligatory, but should be encouraged. The disadvantage of compressing is that unpacking the dictionary file has to be before DS can use it and it takes time. For this reason we might use dictzip, which helps to randomly access word articles. But! we need to check that putting several files into dictzip will work.

We also should provide an easy way to download a dictionary with/without all media files. As it was said:

In short, the main content (dictionary itself) can/should be compressed with dictzip, the media resources (images, audio, video) can/should be compressed with regular zip (but one need to be careful about file names encoding in such a zip file).

If xdxf file is put into an archive, of course the archive file can be named more liberally than it is prescribed now.

soshial commented 7 years ago

Internationalization

Speaking of transliteration/writing system/regionality we can use built into XML tag xml:lang as recommended here. They prescribe using BCP 47 standard, which includes these examples:

Does this standard cover your case? It looks quite promising to me. We might also tag each <def> or <k> with corresponding xml:lang, since we might also have multilingual dictionaries (e.g. English-Polish-Lithuanian-Latvian dictionary).

k-sl commented 7 years ago

I.

We fully agree.

II.

I'm a professional translator but complete amateur when it comes to the technical side, so I all I can give is my amateurish opinion. Here are two reasons why dictzip might not be the best choice:

  1. Unlike XML, .zip, etc., dictzip is a much more obscure format most DS creators might not be familiar with or willing to look into. As far as I can see it was used mostly for dict, which is quite obsolete in itself and might mean dictzip isn't so easy to support on some platforms. This could impact the adoption of XDXF which at the moment is already quite low
  2. At the moment most DS will import the XDXF file into their own internal format or database, which makes random access useless, as the file is only read once. Of course it would be good thing to see XDXF-specific DS that will read XDXF directly as a native format but that is not really a problem right now. Maybe this is a problem for a later time?

What I would like to see is a epub-like format: simple, clear, transparent. Of course most html files in an epub are a couple hundred kb and not dozens of MBs, which is why I understand your point. But I just don't think this is a huge problem at the moment.

With the kind of format I'm suggesting we can also have a modular approach. All archives can have manifest file indicating what the file consists of and what dictionary it belongs to (files which an actual dictionary can have a different extension, like .xdxe -- xdxf extension). Something like this.

<type>extension</type>
<dictionary>Oxford English Dictionary</dictionary>
<filename>oed.xdxf</filename>
<contents>img;sound</contents>
<index>
    <img>img.xml</img>
    <sound>sound.xml</sound>
</index>

The index files can will have a list of the files, their type, an optional description, and the headword under which they should appear (preferably by ID). The actual dictionary manifest file would have <type>dictionary</type> and and the contents can indicate only a dictionary if the archive has no media or a dictionary any media that is included in the same archive. People making the archive can decide what to include or not with the dictionary itself. That xml example is a mock-up, I'm not suggesting that should be the final format.

Internationalization

That is a great find! It can be used for the transliterations, alternative spellings and pronunciations. The languages in lang_from="XXX" and lang_to="XXX" should also use the same format, for consistency. So en instead of eng and zh instead of chi.

I'm not seeing some of the Japanese scripts I think should exist. It has kana, but not katakana and hiragana separated. No Kunrei-shiki or Nihon-shiki (unless they are under very strange names), which aren't commonly used in dictionaries but exist, Kunrei-shiki being the official government romanization. Modified Hepburn, which is the most common romanization in dictionaries shows as "Hepburn romanization, Library of Congress method" which is a very US-centric naming.

However, even if not 100% of systems are available (which would be impossible) they seem to have a working proposal submission system, so more can be added. And, last case scenario, the standard supports private-use tags, which we could use as an exception, if needed.

The question is whether to use the xml:lang attribute or to use more readable attributes while still using BCP 47 standard tags. For example:

<tl system="pinyin">Zhōngguó</tl>
<tl system="Bopo">ㄓㄨㄥ ㄍㄨㄛˊ</tl>
<tl system="wadegile">Chung1-kuo2</tl>

Is much more clear and human-readable when the language has already been defined as Chinese. However:

<tl xml:lang="zh-Latn-pinyin">Zhōngguó</tl>
<tl xml:lang="zh-Bopo">ㄓㄨㄥ ㄍㄨㄛˊ</tl>
<tl xml:lang="zh-Latn-wadegile">Chung1-kuo2</tl>`

Is more canonical and makes for an easier DTD but is harder to read by humans and is less clear as it defines more than the system. The same is valid for the other sections where xml:lang is useful. The article you linked to discusses this issue. I'm putting this question forward but I don't think it is huge issue. I like it clear but I wouldn't oppose either way.

nikita-moor commented 5 years ago

II. File Structure

Conjugations should be included in an independent XML conjugations standard file (to be developed) and referenced in the index file.

Why not create an independent dictionary with conjugation tables?

III. XML Structure

  1. Lexicon

An obvious example is Chinese: it is common to have all entries in both main variants, simplified and traditional Chinese, which, with the current format, means all entries are doubled in the DS

<k system="simplified">词典</k>
<k system="traditional">詞典</k>

You could produce two separate variants of the dicitonary, one for Simplified script and another for Traditional. It's straightforward and does not requiere special handling by the dictionary shell.

k-sl commented 5 years ago

Why not create an independent dictionary with conjugation tables?

I'm sorry I didn't understand what you mean by "independent dictionary with conjugation tables".

You could produce two separate variants of the dicitonary, one for Simplified script and another for Traditional. It's straightforward and does not requiere special handling by the dictionary shell.

The dictionary needs to have both both simplified and traditional Chinese headwords; you need to be able to look up a word in any of the two standards, regardless of which variant is used for the definitions. You also need to be able to see the characters used in the alternative standard when looking up a word. Your suggestion would mean all entries for which simplified and traditional characters are the same would be repeated and that, when looking up a word, the reader would have no way to know how the word is written in the other standard. Besides, most of what I'm describing already works fine in XDXF, I just add both <k> tags to to each article on my Chinese dictionaries and I've been using them like this for years. The problem is there is no way to define which is which, something that should be defined semantically, so the DS can show which is which, hide one if the reader wants to do so, and show the preferred version first, in all Chinese dictionaries.

See, e.g. this example for Cross-strait Dictionary, the dictionary definition is in traditional Chinese, you need to be able to find it through both standards.

nikita-moor commented 5 years ago

I think I start to understand your position—you want to add more semantic features to XDXF. However, it's not a semantic storage of lexical information but the final result. Comparing to other existing formats, such as DSL (ABBYY Lingvo) or BGL (Babylon), XDXF separates content and styles, in a manner of HTML+CSS. It defines some level of semantic, but only in aim of correct rendering.

Many features you are instresting in, could be made in TEI format. It's more flexible but also more complicated. It would be wonderful, if GoldenDict adds support of TEI format with automatic XSLT transformation and CSS styles assigned to every dictionary independently. That will be the most powerful way, so dictionary compilers could define any additional elements and control how to show them.

Anyway, its only my opinion, it would be better to hear words of @soshial .

soshial commented 5 years ago

Hey @k-sl. I have awoken from a long slumber =D and I have finished organizational stuff: removed all converter code, its files.

  1. For changelog, usually people use this section on Github: https://github.com/soshial/xdxf_makedict/releases. Let's stick to that, okay? Then, I will delete this CHANGELOG file. Agreed?
  2. Renaming repository maybe is not the best move at the moment, because inside XDXF files schema links to the DTD file. Should we maybe start using revision numbers in the DTD schema url?
  3. Listing software that supports XDXF is importnat, I think. Could you help me fill up the list HERE?
k-sl commented 5 years ago

Hi, @soshial , nice to see you active again. I myself don't currently have much time, much has happened in the mean time. However, I'd like to help as I am able to.

  1. The Github releases section is really meant for software so you can have a summary of the changes when you share a new release. I don't think that is the the most appropriate way to log changes in this project, which is not a piece of software. I would like any DS developer to be able to just open a plain text file and see every change he needs to do to support the most recent revision. Also to download, share, etc, which is made harder if you tie the project to the Github releases page. Essentially I want it to be as clear and easy as possible; we want to make implementing/updating XDXF support as simple as possible. I don't think there is a problem with also using the releases page, either for a summary of the changes or for full list, but I think there should be a plain text file with a clear, detailed and extensive list of changes in the format.
  2. I don't think think you need to rename this repository, you can leave it as is so any any DS/tool reliant on it will keep working as before. But I would suggest starting a new repository "XDXF" (or similar) to use as the official repository from now on. Again, to make it easier for DS developers to implement it, we want to make it very clear this is a dictionary format that is independent of any tool or software and that the text on the repository is the official and up-to-date standard definition. A folder on the "xdxf_makedict" repository can create confusion. You could leave a note on this repository saying the makedict tool was discontinued and the official repo for the XDXF standard is soshial/xdxf, for example.
  3. I really only know of the ones I see @nikita-moor already mentioned on a separate issue. I also believe GoldenDict for Android doesn't support XDXF; Alpus does. QTranslate is a translator for Windows but claims to support lookup of XDXF dictionaries. I haven't used it, can't confirm.
soshial commented 5 years ago

Answering to def and deftext criticism, I created examples for you here: https://github.com/soshial/xdxf_makedict/issues/37

manfred4321 commented 3 years ago

You might want to know about this: there is a full fledged dictionary exchange format, used by mainly linguistic software from SIL, and probably by many hundreds of linguists to create dictionaries: It's called LIFT, see https://github.com/sillsdev/lift-standard - unfortunately without a bridge to the word of dictionary programs like GoldenDict (this is what I really like about XDXF) . The description alone is a 38 page document! But maybe it has some inspiration for the future of xdxf? And I do hope that one day there will be a converter LIFT-XDXF...

soshial commented 2 years ago

I was thinking of removing <opt> tag from the standard, because it's very inflexible.

Instead of current <k><opt>the</opt> United States</k> I was thinking of using sortby attribute like this: <k sortby="United States">the United States</k>. This attribute will ensure that the United States will be close to words unity and united. This will help avoid a situation when thousands of articles that start with the are sorted/accumulated together.

Anyone agrees/disagrees?

soshial commented 2 years ago

By the way, I have updated the specification, taking into account some of your suggestions, @k-sl and @nikita-moor.

The main changes are: