Dictionary handling changes in 4.24.15672

pgaskin commented 3 years ago

I will need to test everything again and see which bugs have been fixed and what other changes have been made.

See pgaskin/kobopatch-patches#76 for some preliminary notes.

I will probably do this in two releases: A minor release for the new installation process and list of pre-installed dictionaries later this week, and a major one within the next few weeks for the new v3 format and matching rules. Each release will consist of documentation and tool updates.

pgaskin commented 3 years ago

I've merged #13 into this issue.

pgaskin commented 3 years ago

Notes about .kobo/custom-dict handling:

It must match the glob dicthtml* or it will be ignored.
If the name is in the format dicthtml-XXX.zip where XXX is a valid language code, the dictionary will display as the language name plus (Custom). This will still work correctly if there is already a built-in dictionary for that language.
If the name is in the format dicthtml-XXX-YYY.zip where XXX and YYY are valid language codes, the dictionary will be displayed as the two language names separated by - plus (Custom) at the end (i.e. the same rule, but for translation dictionaries).
If it matches the glob dicthtml* and does not match any of the two rules above, the display name will be the same as the filename (case is preserved).
ExtraLocales does not affect the dictionary locales until a reboot. If a locale is listed in ExtraLocales, it is treated the same way as valid language codes (see above) and is prefixed with Extra:.
The bug with spaces in the name from Extra: causing no words to be found has been fixed.
If the display name contains spaces (other than from Extra: and (Custom)), it will trigger the bug where it can't find words in the dictionary (the same thing as what used to happen with Extra: dictionaries).
The dictionary filename must end with .zip or be an extracted dictzip otherwise it will trigger the same bug. Extracted dictionaries work fine with all features, but the HTML files will be imported as books. Extracted dictionaries will take priority over a packed one if they have identical names.
Note: Extracted dictionaries might have been supported in older firmware versions too.
Custom dictionaries have the correct file size and can be managed from the settings. They will not be overwritten by a sync.

pgaskin commented 3 years ago

Notes about .kobo/dict handling:

Dictionary naming follows the same rules, but without the (Custom) part.
Adding additional dictionaries works fine here just like before.
Extracted dictionaries work and will not be imported as books.
Extracted dictionaries will be synced as if they were zip files (i.e. .zip will be appended to the directory name when attempting to sync it).
- But, the file size will be calculated from the extracted dictionary. Most of the time, it will cause it to think the download is always out of date and re-download the zip (which won't even be used since the extracted one exists and has priority). This essentially makes the extracted dictionary support useless for now. Note that the firmware doesn't extract dictionaries by itself at the moment, so I don't think it's meant to be used.
All dictionaries in this folder will have their size compared with the dictionary download server on sync. If it doesn't match, it is overwritten with the server version. If it doesn't exist, it is left untouched.

pgaskin commented 3 years ago

Notes about word matching:

The non-Japanese matching rules are identical based on a quick look at the code.
I haven't looked at the Japanese rules (I didn't do this before either).
Regexp matching appears to be the same as before.

pgaskin commented 3 years ago

Notes about dictzip v3 vs v2:

There is a new optional prefix_exceptions file, presumably as a better way to handle variants with different prefixes (dictutil currently works around this by duplicating the definition). I think Kobo finally realized that their own dictionaries were affected by that bug...
- Based on a quick skim of the code, it appears to search for the word in the prefix_exceptions as a Marisa trie, then split by a tab char and take the second part as the new word to use in place of the original one (the original one won't be checked). (TODO: test this)
- TODO: check if you can have multiple target prefixes for a word by adding more separated by tabs.
- TODO: determine the behaviour if there are multiple entries in the trie for the same word
A custom credit line can be added to the bottom of all entries in the dictionary (newline then the text in italics) by setting it as the zip global archive comment.
Everything else seems to be pretty much the same as v2.

pgaskin commented 3 years ago

Notes about built-in dictionaries:

As mentioned in the release notes, there are 4 new built-in dictionaries.
dicthtml-en-ja-pgs.zip has been discontinued and will be automatically deleted on upgrade. I will still keep this in the list for dictutil for backwards compatibility.
New v3 download URL (dictutil will need to decide which to use based on the firmware version).

pgaskin commented 3 years ago

Other notes:

The bug with embedded images accessed with dict:/// is still there, but it isn't really a bit deal.
TODO: I haven't tested this or looked into it, but it's possible the prefix_exceptions handling will cause a different bug for complex dictionaries: If a word is a variant in one file and an actual entry in another, won't the prefix_exceptions handling for the variant cause it to ignore the actual entry in the original file for the word prefix?
Handling for changed dictionaries has been improved.
The sync code has been simplified and refactored, but it doesn't have any special things which aren't directly related to the things mentioned above for custom/built-in dictionaries and it isn't much different than in previous firmware versions.

pgaskin commented 3 years ago

I think I'm pretty much finished with finding the changes. I'll take another look once the new dictionaries are published, but I think I've found everything.

I still need to actually test and confirm the behaviour of prefix_exceptions.

The rest of the information in the comments above comes from a combination of reading the disassembly, hooking functions, and doing actual testing.

pgaskin commented 3 years ago

From my post on MobileRead:

prefix_exceptions is somewhat of a misnomer, since it doesn't actually make exceptions for prefixes. Instead, it should be called word_redirects, since it just changes the word being looked up to another if it matches exactly. The target file must still have a variant/word matching the new one, and the original file won't be looked in at all.

This also means that there's already a bug in prefix_exceptions, albeit the inverse of the reason why prefix_exceptions was created. Previously, with v2 dictionaries, variants with a different headword prefix wouldn't be found (I worked around this in dictutil by duplicating the entries). Now, if you have a headword named after a redirected variant, it won't be found. For example, with the previous v2 behaviour, the entry for go/went would need to be duplicated into go.html and we.html, and you could also have another unique definition titled went in we.html. With the new v3 behaviour, you can just define it go/went in go.html and add a redirect entry like "went\tgo" to redirect it. But, this is where the new bug happens. Now, if you had a second entry in we.html named "went" (remember that Kobo dictionaries support multiple entries for a word), it won't be found since the words was redirected to "go". I can work around this bug by duplicating the headwords into the redirected files...which is just the counterpart to my previous workaround.

pgaskin commented 3 years ago

The change made in 4.24.15676 appears to make it support multiple prefix exceptions and loop over them when looking up the definition. I will test it later today.

jackiew1 commented 3 years ago

@pgaskin Just a few additional notes:

If you have a built-in and a custom dictionary with the same name, e.g. dict/dicthtml.zip and custom-dict/dicthtml.zip, then although both work OK you cannot have the custom version "saved" as the default for a book, even if the custom version is the last one you used for a lookup. This means that every time you lookup a new word it always displays the Kobo built-in definition first. You can't cheat by naming the custom version as custom-dict/dicthtml-en.zip but you can by naming it custom-dict/dicthtml-en-en.zip
The (Custom) suffix is easily patchable in libnickel.so.1.0.0.yaml. I changed mine to (*) for brevity but I haven't added an official patch for now.
if the language is RTL, e.g. Hebrew-English custom-dict/dicthtml-he-en.zip, you also get an interesting RTL display in the drop-down dictionary menu - (Custom) English - Hebrew (in Hebrew chars) - see attached.
If you want to use meaningful names for your custom dictionaries you have to be a bit careful. For example I tried to name my kobo-ised Kindle New Oxford American to dicthtml-NOA.zip and ended up with something unreadable (to me). Presumably looking up NOA in their language code table resulted in an accidental 'hit'. I also tried something like dicthtml-NewOxAmer.zip (I can't remember exactly) and got something equally unreadable. dicthtml-OxAm.zip works OK.

pgaskin commented 3 years ago

15676 changes:

Prefix exceptions now behave like prefix exceptions rather than word redirects. This is implemented by having DictionaryParser::htmlForWord take a second parameter which will override the prefix generation for the word. This has multiple (beneficial) implications:
- Variant lookup behaviour is now consistent with words without exceptions.
- It fixes the bugs in 14672 caused by the exception target being looked up as an entirely new word.
- It means we can now redirect to any arbitrary prefix file, and have the original word looked up in it instead (this would mainly be useful as a workaround for if the prefixes for a dictionary exceeds the maximum number of files or if a single prefix file gets too large).
- Since it's now consistent, we don't need to implement additional workarounds to deal with complexities introduced by the 15672 implementation of prefix_exceptions, and it's a lot more clear how they are meant to be used.
- Note that this is not compatible with 15672 (e.g. if you try and use the current v3 English dictionary on 15672, it won't work properly: "miaowing", which is a variant of "meowing", will show the entries for "me" on 15672 instead of "meowing" from the "me.html" file on 15676).
- TODO
There can now be multiple prefix exceptions for a word.
There are now 2 numbers after word trie entries. I'll have to figure out what these are for.
TODO

pgaskin commented 3 years ago

It appears they've crippled the old v2 dictionaries, at least the English one (they are now empty with a large file named "junk" filled with zeros). Presumably, the licensing expired for them. The file modification times show September 24 (the release date of 15676), but I don't think these were uploaded until October 1 (the release date for the new v3 dictionaries).

pgaskin / dictutil

Dictionary handling changes in 4.24.15672 #14