open-dsl-dict / wiktionary-dict

Offline bilingual dictionaries made using data from Wiktionary
Other
52 stars 4 forks source link

Missing abbreviations #1

Open LRN opened 5 years ago

LRN commented 5 years ago

These dictionaries have sequences that look like [p]<n>[/p] or [p]<v>[/p] and so on. Those are references to other dictionary cards named <n> and <v>, but such cards do not exist. Normally, a dsl dictionary is accompanied by a small dictionary (which has the same name as the main directory, but with an _abrv suffix, i.e. en-arb-enwikionary_abrv.dsl; usually uncompressed) that contains all the abbreviations that the main dictionary uses. I would suggest creating such dictionary (how to make sure the dictionary viewer sees the dictionary and uses it is another matter; creating multiple differently-named copies of the same dictionary (to follow the _abrv naming convention) seems impractical).

dohliam commented 5 years ago

The abbreviations wrapped in <> were not actually meant to be links (you may be thinking of double angle brackets <<>> which create hyperlinks to other cards in DSL dictionaries).

If you look at the source dictionaries that these were converted from, you will see that these sequences were originally wrapped in {}, which would have caused problems for the DSL format, so they were changed to <> which as far as I know is unspecified.

The idea of adding an abbreviation dictionary is interesting but to my knowledge there is no such list of abbreviations included with the original dictionaries, so someone would have to compile it from scratch.

Apart from the issues you mentioned (I haven't seen any dictionaries that use this format, so not sure how widely it is supported), it seems that the abbreviations used in the dictionaries are all relatively self-explanatory: <n> for noun <v> for verb etc. So it may not be necessary to go to the trouble of creating an abbreviation file in any case.

LRN commented 5 years ago

Um...no. I am aware of the <<reference_word>> links (which are functionally equivalent to [ref]reference_word[/ref] links), and they are not the problem. The problem is in [p]reference_word[/p] sequences, such as [p]<n>[/p]. These are called "labels" in DSL parlance. In a DSL viewer a label is a piece of text that shows a tooltip when the user hovers a mouse pointer over it, and the tooltip contains the body of the DSL card (from the abbreviation dictionary) the headword for which is in the [p][/p] tag. That is, for [p]<n>[/p] it would display the text that consists of <n> and has a tooltip that says noun or something. From your explanation it seems that < and > shouldn't really be there either, if the original source contained {n}, meaning a reference to n.

AFAIU, this concept harks back to dictionaries printed on pieces of dead trees, where space was too precious to spell "noun" for every noun, thus the abbreviations (and an abbreviation index somewhere at the end). In practice, DSL dictionaries contain more than n, v, and adj - they also have amer., coll., sport. and other kinds of labels. Thus an abbreviation dictionary becomes a necessity.

The common source of DSL dictionaries are LSD dictionaries (i.e. DSL dictionaries compiled to binary format). They can be decompiled (there seems to be software for that). That process (and the distribution of the decompiled dictionaries) is of questionable legality, which is probably why you haven't seen these dictionaries just floating around publicly on the web. Other than that, people just convert other dictionaries to DSL when needed (i.e. when the dictionary viewer they are using understands DSL but does not understand the format a dictionary is in; GoldenDict is a popular viewer that understands DSL and a few other formats, but does know how to read things like ".mobi" dictionary books, for example). DSL is, essentially, a big text file (and a few small text files; compression is completely optional), and thus easy to convert to.

dohliam commented 5 years ago

Thanks for the additional info about references using <>. I think I see now what the problem is.

I'm not sure if this is a new addition to the DSL format or simply something that, as you suggested, is mainly found in binary/commercial dictionaries. These dictionaries were compiled more than 4 years ago, and I don't have access to any commercial dictionaries to compare with. Instead, my reference point for the DSL format has been the sample.dsl file provided by the GoldenDict project.

In any case, the <> syntax is not recognized by the viewer that I use (also GoldenDict), so this has not been a problem so far. The angled brackets were chosen as a way of separating the abbreviations from the rest of the text, not linking them, and this works quite well in this context. I agree of course that for other viewers that do recognize this syntax, it would probably be better to search and replace these with regular brackets (or perhaps escaped square brackets or something else). As you probably know, the dsl.dz compressed format can be easily decompressed using dictunzip and then recompressed with dictzip if desired, for easy editing of individual dictionary files.

If you would like to regenerate all of the dictionaries using round/square/other brackets from source you can do so quite easily by modifying the wikt2dsl script and running it on the source directory in this repository.

Note that the original source of these dictionaries is a script that extracts them from the freely-licensed Wiktionary project rather than any collection of existing DSL dictionaries, so the {a.}, {v.} etc. notation is just an artifact of that process rather than a reference or link to other cards or information.