pgaskin / dictutil

Tools, documentation, and libraries related to Kobo dictionaries.
https://pgaskin.net/dictutil
MIT License
55 stars 4 forks source link

Mutiple headwords as seperated entries? #19

Closed thehijacker closed 2 years ago

thehijacker commented 2 years ago

Hello,

I have made my own dictionary in DF format and coverted it to Kobo zip file using pyGlossary. Author suggested I ask here for information regarding valid entry format that Kobo reader can process. Original thread:

https://github.com/ilius/pyglossary/issues/356#issuecomment-1023167507

Almost all my df entries look like:

@ entry_name
: \pronanuciation\
& alias1
& alias2
<html><i>noun</i> here comes the descrition

But some can have different meaning:

@ name1
: \aaaa\
<html><i>noun</i> description

@ name2
: \aaaa\
<html><i>adjective</i> different description

Kobo doesn't seem to find this when looking at word. What is proper way to define this so Kobo can show both of the words? Should I merge somehow? Or add alias to it (&)?

I also have some headwords with optional characters that are inside parentheses. Not sure if this works.

@ backward(s)
: \baekwəd(z)\
<html><i>adverb</i> description...

Thank you for all the suggestions.

pgaskin commented 2 years ago

Have you tried this with dictgen instead of PyGlossary? If I understand what you're asking, it should work fine.

Regarding backward(s), you'll have to use variants to match all forms of the word (also note that kobo does more than just exact matches).

thehijacker commented 2 years ago

I apologize in advanced. Did not know df is your "standard" :). I was googling on how to make my own dictionary and went from penelope to PyGlossary. On your web page I only looked on how df format looks.

Tried dictgen and found I still have some issues inside my df file that PyGlossary passed but dictget did not. When I fixed them I got a seems to be valid zip file.

c:\Temp\eBooks\dictgen>dictgen-windows.exe en-sl.df
Parsing dictfiles.
Error: input "en-sl.df": dictfile: line 10677: no word after variant specifier (&).

c:\Temp\eBooks\dictgen>dictgen-windows.exe en-sl.df
Parsing dictfiles.
Error: input "en-sl.df": dictfile: line 10677: no word after variant specifier (&).

c:\Temp\eBooks\dictgen>dictgen-windows.exe en-sl.df
Parsing dictfiles.
Error: input "en-sl.df": dictfile: line 51548: header info (: or ::) specified within definition content (prepend a space if this was intended to be part of the definition itself).

c:\Temp\eBooks\dictgen>dictgen-windows.exe en-sl.df
Parsing dictfiles.
<html><w><p><a name="A" /><b>A</b></p><var></var><i>abbreviation</i> <i>chemistry</i> argon; absolute; Academy; America(n)</w></html><nil>
Opening output.
Generating dictzip.
  Using image method: optimize and encode as base64 data URL (max_width=1000, max_height=1000, grayscale=false, jpeg_quality=60) (warning: this causes segfaults in the in-book dictionary due to a bug in nickel with firmware versions below 4.20.14601).
Successfully wrote 73947 entries from 1 dictfile(s) to dictzip dicthtml.zip.

Output looks correct? That html entry looks strange :).

So if I understood correctly I can keep samename1, samename2, samename3 and they should be parsed correctly and displayed on Kobo?

Regarding backward(s), you'll have to use variants to match all forms of the word (also note that kobo does more than just exact matches).

This is not clear to me. I need to create two entries. Like this?

@ backward
& backwards

My source for dictonary are text files. I wrote my own program to parse the text files. So I can manipulate the content before assembling the df file. Just want to do it correctly.

Thank you very much.

pgaskin commented 2 years ago

The errors are expected. Note that if they're intended to be part of the content and are at the beginning of a line, you need to escape them by putting a space before them.

That HTML entry looks fine, but was that printed with the output (it shouldn't be doing that). Does the dictionary work correctly on your kobo?

You are correct about the backwards part.

thehijacker commented 2 years ago

First I need to redo the df generation. Fix this (x) words so they are without letters in parentheses and word with included letters is just an alias.

Regarding the alias words (&). Can I use alias word but still have the word definition? For example:

@ word
& alias_word
: \pronoucionation\
<html>html code

@ alias_word
: \pronoucionation\
<html>html code

That HTML entry looks fine, but was that printed with the output (it shouldn't be doing that). Does the dictionary work correctly on your kobo?

Yes. Html code was in output.

Not yet tested it on Kobo zip file generated by dictgen. Need to make df perfect first :). Only the one made from PyGlossary and it was working, but I saw it is not that perfect. All the words with number at end were shown with number and Kobo displayed them as the one it found later, not directly. For example word almost was not found, but he displayed almost1 that is inside dictionary.

Need to test this with dictgen. Will words like test1 test2 test3 be found in Kobo when I highlight word test? Or must I merge this into one dictionary headword?

thehijacker commented 2 years ago

Finished with the (x) words. Now they are aliases and there are no more words like this inside my df. I have also fixed the blank aliases errors and made them unique. No more errors in dictget when generating.

It created valid zip file and it is working on my Kobo (Libra 2). I still see the html code in output of dictgen. This code is from first headword in df file:

@ A
<html><i>abbreviation</i> <i>chemistry</i> argon; absolute; Academy; America(n)

@ A
<html>1 <i>abbreviation</i> first class

Here is the output:

c:\Projekti\KoboDictionaryMaker\KoboDictionaryMaker\bin\Debug>dictgen-windows.exe -o dicthtml-en-sl.zip en-sl.df
Parsing dictfiles.
<html><w><p><a name="A" /><b>A</b></p><var></var><i>abbreviation</i> <i>chemistry</i> argon; absolute; Academy; America(n)</w></html><nil>
Opening output.
Generating dictzip.
  Using image method: optimize and encode as base64 data URL (max_width=1000, max_height=1000, grayscale=false, jpeg_quality=60) (warning: this causes segfaults in the in-book dictionary due to a bug in nickel with firmware versions below 4.20.14601).
Successfully wrote 73947 entries from 1 dictfile(s) to dictzip dicthtml-en-sl.zip.

I also tested this dictionary on my Kobo. It seems to work fine for 99% words. But the words with many meaning and with just different number as suffix are not found. For example:

screen_001

Can you suggest how to solver this properly? Merging all this entries as one headword (disguise in my case) and seperate all word description as html lists?

What would be best example of df file that already has all this features and I could use as reference?

pgaskin commented 2 years ago

To merge separate entries, you'd need to do that in the tool generating the df file.

For examples, look at the GOTDict and Webster's df files (there's tools to generate them in the releases). Those have a few words with multiple separate entries for the same headword, merged entries, images, variants, and other features.

thehijacker commented 2 years ago

Hello. Finished my dictionary. Merged what could be merged inside df file and it looks and works great. Thank you for a great tool.