ilius / pyglossary

A tool for converting dictionary files aka glossaries. Mainly to help use our offline glossaries in any Open Source dictionary we like on any modern operating system / device.
GNU General Public License v3.0
2.18k stars 238 forks source link

AppleDict to Stardict entry formatting #342

Closed hadingtid closed 2 years ago

hadingtid commented 2 years ago

I have converted an AppleDict binary to Stardict. I am noticing considerable formatting differences in the Stardict entries as compared with original dictionary. The entries appear (superficially) to be plain text, however as links are still marked in blue and point to the original entities (rendering them useless at present) I assume that the underlying html has been preserved and so it is perhaps due to the absence from the conversion process of the original css which the apple dictionaries make heavy use of. Compare the following screenshots which illustrate the essential differences:

Figure 1: AppleDict Figure 2: Stardict

The entries are of course useable but the readability would be greatly improved if something approaching the original layout could be preserved. I am not so concerned with the pretty formatting effects as I am with newlines for new word senses/subdefinitions, however I find the varying font weights do aid readability too. Of course, not all fonts have the several weight variants of Helvetica so I am not sure whether a 1 to 1 conversion is possible. Nonetheless, I would be grateful for a solution. I am using the command line interface with the prompt dialogues on Mac Catalina.

Cheers

hadingtid commented 2 years ago

To avoid a takedown request and since you can't type Russian anyway, I've instead put together a mock-up file that should allow you to test the issue at play.

I suspect this could be easily solved by incorporating the DefaultStyles.css file into the conversion process (e.g. by allowing user to specify file path), since that is what gives the apple dictionary its essential structure. To illustrate that point, I've included two screenshots of the html file* with and without the css, which pretty much is the same as the above differences that I am seeing in Goldendict.

*Since AppleDict binaries can only be converted on a Mac (and I've seen you say previously you don't have access to one), I've followed the structure of the exported xml data (as would happen if using 0dedict.py script for example) and then wrapped in html tags for display purposes. The .tab file used for conversion contains same essential data, plus initial metadata and duplicated headwords. The same basic process is also necessary to split the bidirectional bilingual dictionaries that come packaged on Mac (e.g. en-ru-en) into separate one directional files (i.e. en-ru & ru-en).

Source.zip Stardict.zip With css Without css

Let me know how you get on with that.

ilius commented 2 years ago

We do try to read DefaultStyle.css file from AppleDict binary glossary, next to the Contents directory, and put it as res/style.css inside StarDict glossary.

I can't be sure, but I think GoldenDict might recognize css files in res directory and apply them.

Is this res/style.css created inside your output StarDict glossary?

hadingtid commented 2 years ago

Ok. Perhaps that is the first part of the problem, since I first have to convert the binary to tabfile in order to split it into two dictionaries, which keeps all the data but loses the auxiliary files. So it would be ideal to allow detection/specification of the DefaultStyle.css in the active directory (so in my case the main pyglossary directory) to be read together with the other files at conversion. Is this possible to implement? In any case, res/style.css has not been created for any of my conversions (direct AppleDict --> Stardict or via the intermediary .tab file splitting process), and creating the resource manually doesn't seem to have any effect in Goldendict either.

This is the original directory structure: Dictionary Directory The .css is nested under Contents/Resources/DefaultStyle.css within the .dictionary bundle: Dictionary internal directory

On a side note, DefaultStyles.css references @namespace d url(http://www.apple.com/DTDs/DictionaryService-1.0.rng). In this thread you mention that the script removes the d:title tag during AppleDict conversion (though presumably not from .tab files). Out of curiosity, could this conceivably be causing problems here?

ilius commented 2 years ago

since I first have to convert the binary to tabfile in order to split it into two dictionaries, which keeps all the data but loses the auxiliary files. So it would be ideal to allow detection/specification of the DefaultStyle.css in the active directory (so in my case the main pyglossary directory) to be read together with the other files at conversion. Is this possible to implement?

When you convert to test.txt for example, resource files are created in test.txt_res directory, which is then read when you convert test.txt to something else.

But when you split the Tabfile, they are not duplicated. Like if it creates 100 parts, creating 100 copies of data files would be dump! And Windows doesn't support symbolic links. It's also hard to determine which resources are used in each part when we split. So you have to manage the resources yourself (some sym links to test.txt_res directory would solve the problem) or just copy them to the final StarDict yourself.

you mention that the script removes the d:title tag during AppleDict conversion (though presumably not from .tab files). Out of curiosity, could this conceivably be causing problems here?

Yes, that's probably it!

ilius commented 2 years ago

The css file needs to be modified. Doing it in PyGlossary is probably too tricky.

First please try to get it working by modifying the css in the StarDict glossary.

hadingtid commented 2 years ago

I've sent you the file to test. After examining the .dict files again, it appears the d:title tags are still present, both when converted via tabfile and when converted directly from AppleDict. So I don't understand why the .css needs modifying if the tags stay the same - could you suggest how you think it should be changed? Would the .css stylesheet need to be called within each definition?

When converted from AppleDict to tabfile, res/style.css is created (I mustn't have checked resource setting previously), however when I later convert the tabfile to Stardict (even without changing or splitting the file), the resources folder is not used/duplicated. Manually adding it does not work either.

ilius commented 2 years ago

however when I later convert the tabfile to Stardict (even without changing or splitting the file), the resources folder is not used/duplicated.

I just tested this to be sure, and it works. Can you give me the full command line / console output?

ilius commented 2 years ago

Okay, so this could be a small part of the problem: there is no body or html tag in entries.

body
{
    font-size: 12pt;
    font: -apple-system-body;
    font-family: -apple-system;
    margin-left: 0.9em;
    margin-right: 0.9em;
    margin-top: 1.0em;
    margin-bottom: 1.5em;

    color: text;
}
html.apple_client-panel body
{
    margin-top: 0em;
}

But the main problem I guess is lack of newlines and indentation (for the numbering). I have prettified entry for ящурный (the last entry):

<d:entry xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng" id="r_36601" d:title="ящурный" class="entry">

<span class="hwg x_xh0">
    <span d:dhw="1" role="text" manifest="я́щурный" class="hw">я́щурный </span>
</span>
<span lexid="b-ru-en0051374.001" class="gramb x_xd0">
    <span class="ps x_xdh">adjective </span>
    <span lexid="b-ru-en0051374.002" class="semb x_xd1 hasSn">
        <span class="gp x_xdh sn ty_label tg_semb">1 </span>
        <span class="gr x_xd2">adj of </span>
        <span class="xrg x_xd2"><span class="xr">
            <a href="x-dictionary:r:r_DWS-015748:com.apple.dictionary.OxfordRussian"
                    title="я́щур">ящур </a>
        </span>
        </span></span>
        <span lexid="b-ru-en0051374.003" class="semb x_xd1 hasSn">
            <span class="gp x_xdh sn ty_label tg_semb">2 </span>
            <span class="trg x_xd2"><span class="trans">infected with foot-and-mouth disease </span>
        </span>
    </span>
</span>
</d:entry>

span tags are normally rendered without anything (like a newline) separating them in the output, unlike div or p. But there are bunch of spans in the CSS so I guess this glossary is changing this behavior with CSS.

This is very ugly and non-standard. I'm not surprised it doesn't work in GoldenDict. Because GoldenDict is not a browser.

For lists and numbering, we should use ol tag (and ul tag for list bulletin list). And div and p for paragraphs and sub-sections.

ilius commented 2 years ago

I would suggest you convert to Aard 2 (.slob) format and try Aard2 for Web, because it leaves HTML rendering to your browser.

Or find a new glossary, like FreeDict's eng-rus.

hadingtid commented 2 years ago

there is no body or html tag in entries.

This solved it, thanks. I had suspected as much (after all, wrapping those samples as html worked perfectly) but didn't know where to put the tags. So I created a test glossary with every entry wrapped in <!DOCTYPE html><html><head><link rel="stylesheet" href="DefaultStyle.css"></head><body> and </body></html>. I ignored prettification on the assumption newlines will confuse the tabfile conversion. This is the result (again, sample word такой), looks pretty similar to the original and most importantly is far easier to read:

Bildschirmfoto 2021-12-01 um 12 06 54

This is very ugly and non-standard. I'm not surprised it doesn't work in GoldenDict. Because GoldenDict is not a browser.

None of this seems to be particularly relevant here. I agree it's not 100% ideal (and Goldendict still seems to be overriding several font parameters), however AppleDict has some of the best glossaries out there so I think it's worth ironing them out to at least display in a readable form. Although the stylesheets do vary slightly between dictionaries, as far as I can tell, the underlying entry format is more or less the same so this html wrapping approach should work on all the standard AppleDict binaries. It was a 2 second find and replace, so I imagine it would be relatively straightforward to automate this step at conversion time. Would it be possible to implement this functionality as part of pyglossary?

ilius commented 2 years ago

What did you search & replace?

hadingtid commented 2 years ago

What did you search & replace?

I figured the most reliable way for my situation would be to find <d:entry and </d:entry> and replace with <!DOCTYPE html><html><head><link rel="stylesheet" href="DefaultStyle.css"></head><body><d:entry and </d:entry></body></html> respectively. I wasn't sure how to implement this in script form, so I just used a text editor.

ilius commented 2 years ago

I pushed a commit. You can try again with --read-options 'html_full=True' flag.

soshial commented 1 year ago

With all AppleDict dictionaries (that I converted to Aard2Slob) there were problems with style, because html_full setting was False. Can we maybe make it True by default? @ilius

This is what it looks like when it's False: Screenshot_20230307_080031_Aard 2

ilius commented 1 year ago

Not all dictionary applications support a full html page/document. I don't think most of them do.