Closed hadingtid closed 2 years ago
To avoid a takedown request and since you can't type Russian anyway, I've instead put together a mock-up file that should allow you to test the issue at play.
I suspect this could be easily solved by incorporating the DefaultStyles.css file into the conversion process (e.g. by allowing user to specify file path), since that is what gives the apple dictionary its essential structure. To illustrate that point, I've included two screenshots of the html file* with and without the css, which pretty much is the same as the above differences that I am seeing in Goldendict.
*Since AppleDict binaries can only be converted on a Mac (and I've seen you say previously you don't have access to one), I've followed the structure of the exported xml data (as would happen if using 0dedict.py script for example) and then wrapped in html tags for display purposes. The .tab file used for conversion contains same essential data, plus initial metadata and duplicated headwords. The same basic process is also necessary to split the bidirectional bilingual dictionaries that come packaged on Mac (e.g. en-ru-en) into separate one directional files (i.e. en-ru & ru-en).
Let me know how you get on with that.
We do try to read DefaultStyle.css
file from AppleDict binary glossary, next to the Contents
directory, and put it as res/style.css
inside StarDict glossary.
I can't be sure, but I think GoldenDict might recognize css files in res
directory and apply them.
Is this res/style.css
created inside your output StarDict glossary?
Ok. Perhaps that is the first part of the problem, since I first have to convert the binary to tabfile in order to split it into two dictionaries, which keeps all the data but loses the auxiliary files. So it would be ideal to allow detection/specification of the DefaultStyle.css
in the active directory (so in my case the main pyglossary
directory) to be read together with the other files at conversion. Is this possible to implement? In any case, res/style.css
has not been created for any of my conversions (direct AppleDict --> Stardict or via the intermediary .tab file splitting process), and creating the resource manually doesn't seem to have any effect in Goldendict either.
This is the original directory structure:
The .css is nested under Contents/Resources/DefaultStyle.css
within the .dictionary
bundle:
On a side note, DefaultStyles.css
references @namespace d url(http://www.apple.com/DTDs/DictionaryService-1.0.rng)
. In this thread you mention that the script removes the d:title
tag during AppleDict conversion (though presumably not from .tab files). Out of curiosity, could this conceivably be causing problems here?
since I first have to convert the binary to tabfile in order to split it into two dictionaries, which keeps all the data but loses the auxiliary files. So it would be ideal to allow detection/specification of the DefaultStyle.css in the active directory (so in my case the main pyglossary directory) to be read together with the other files at conversion. Is this possible to implement?
When you convert to test.txt
for example, resource files are created in test.txt_res
directory, which is then read when you convert test.txt
to something else.
But when you split the Tabfile, they are not duplicated. Like if it creates 100 parts, creating 100 copies of data files would be dump! And Windows doesn't support symbolic links.
It's also hard to determine which resources are used in each part when we split.
So you have to manage the resources yourself (some sym links to test.txt_res
directory would solve the problem) or just copy them to the final StarDict yourself.
you mention that the script removes the d:title tag during AppleDict conversion (though presumably not from .tab files). Out of curiosity, could this conceivably be causing problems here?
Yes, that's probably it!
The css file needs to be modified. Doing it in PyGlossary is probably too tricky.
First please try to get it working by modifying the css in the StarDict glossary.
I've sent you the file to test. After examining the .dict files again, it appears the d:title
tags are still present, both when converted via tabfile and when converted directly from AppleDict. So I don't understand why the .css needs modifying if the tags stay the same - could you suggest how you think it should be changed? Would the .css stylesheet need to be called within each definition?
When converted from AppleDict to tabfile, res/style.css
is created (I mustn't have checked resource setting previously), however when I later convert the tabfile to Stardict (even without changing or splitting the file), the resources folder is not used/duplicated. Manually adding it does not work either.
however when I later convert the tabfile to Stardict (even without changing or splitting the file), the resources folder is not used/duplicated.
I just tested this to be sure, and it works. Can you give me the full command line / console output?
Okay, so this could be a small part of the problem: there is no body
or html
tag in entries.
body
{
font-size: 12pt;
font: -apple-system-body;
font-family: -apple-system;
margin-left: 0.9em;
margin-right: 0.9em;
margin-top: 1.0em;
margin-bottom: 1.5em;
color: text;
}
html.apple_client-panel body
{
margin-top: 0em;
}
But the main problem I guess is lack of newlines and indentation (for the numbering). I have prettified entry for ящурный
(the last entry):
<d:entry xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng" id="r_36601" d:title="ящурный" class="entry">
<span class="hwg x_xh0">
<span d:dhw="1" role="text" manifest="я́щурный" class="hw">я́щурный </span>
</span>
<span lexid="b-ru-en0051374.001" class="gramb x_xd0">
<span class="ps x_xdh">adjective </span>
<span lexid="b-ru-en0051374.002" class="semb x_xd1 hasSn">
<span class="gp x_xdh sn ty_label tg_semb">1 </span>
<span class="gr x_xd2">adj of </span>
<span class="xrg x_xd2"><span class="xr">
<a href="x-dictionary:r:r_DWS-015748:com.apple.dictionary.OxfordRussian"
title="я́щур">ящур </a>
</span>
</span></span>
<span lexid="b-ru-en0051374.003" class="semb x_xd1 hasSn">
<span class="gp x_xdh sn ty_label tg_semb">2 </span>
<span class="trg x_xd2"><span class="trans">infected with foot-and-mouth disease </span>
</span>
</span>
</span>
</d:entry>
span
tags are normally rendered without anything (like a newline) separating them in the output, unlike div
or p
. But there are bunch of span
s in the CSS so I guess this glossary is changing this behavior with CSS.
This is very ugly and non-standard. I'm not surprised it doesn't work in GoldenDict. Because GoldenDict is not a browser.
For lists and numbering, we should use ol tag (and ul tag for list bulletin list). And div
and p
for paragraphs and sub-sections.
I would suggest you convert to Aard 2 (.slob) format and try Aard2 for Web, because it leaves HTML rendering to your browser.
Or find a new glossary, like FreeDict's eng-rus.
there is no
body
orhtml
tag in entries.
This solved it, thanks. I had suspected as much (after all, wrapping those samples as html worked perfectly) but didn't know where to put the tags. So I created a test glossary with every entry wrapped in <!DOCTYPE html><html><head><link rel="stylesheet" href="DefaultStyle.css"></head><body>
and </body></html>
. I ignored prettification on the assumption newlines will confuse the tabfile conversion. This is the result (again, sample word такой), looks pretty similar to the original and most importantly is far easier to read:
This is very ugly and non-standard. I'm not surprised it doesn't work in GoldenDict. Because GoldenDict is not a browser.
None of this seems to be particularly relevant here. I agree it's not 100% ideal (and Goldendict still seems to be overriding several font parameters), however AppleDict has some of the best glossaries out there so I think it's worth ironing them out to at least display in a readable form. Although the stylesheets do vary slightly between dictionaries, as far as I can tell, the underlying entry format is more or less the same so this html wrapping approach should work on all the standard AppleDict binaries. It was a 2 second find and replace, so I imagine it would be relatively straightforward to automate this step at conversion time. Would it be possible to implement this functionality as part of pyglossary?
What did you search & replace?
What did you search & replace?
I figured the most reliable way for my situation would be to find <d:entry
and </d:entry>
and replace with <!DOCTYPE html><html><head><link rel="stylesheet" href="DefaultStyle.css"></head><body><d:entry
and </d:entry></body></html>
respectively. I wasn't sure how to implement this in script form, so I just used a text editor.
I pushed a commit.
You can try again with --read-options 'html_full=True'
flag.
With all AppleDict dictionaries (that I converted to Aard2Slob) there were problems with style, because html_full
setting was False. Can we maybe make it True by default? @ilius
This is what it looks like when it's False:
Not all dictionary applications support a full html page/document. I don't think most of them do.
I have converted an AppleDict binary to Stardict. I am noticing considerable formatting differences in the Stardict entries as compared with original dictionary. The entries appear (superficially) to be plain text, however as links are still marked in blue and point to the original entities (rendering them useless at present) I assume that the underlying html has been preserved and so it is perhaps due to the absence from the conversion process of the original css which the apple dictionaries make heavy use of. Compare the following screenshots which illustrate the essential differences:
The entries are of course useable but the readability would be greatly improved if something approaching the original layout could be preserved. I am not so concerned with the pretty formatting effects as I am with newlines for new word senses/subdefinitions, however I find the varying font weights do aid readability too. Of course, not all fonts have the several weight variants of Helvetica so I am not sure whether a 1 to 1 conversion is possible. Nonetheless, I would be grateful for a solution. I am using the command line interface with the prompt dialogues on Mac Catalina.
Cheers