pgaskin / dictutil

Tools, documentation, and libraries related to Kobo dictionaries.
https://pgaskin.net/dictutil
MIT License
55 stars 4 forks source link

Possible problem with marisa-build #15

Open jackiew1 opened 3 years ago

jackiew1 commented 3 years ago

@pgaskin No rush, but when you have some time please will you have a look at this.

Back in Feb 2020 you were kind enough to create some Windows 64bit .exe versions of various Marisa functions. I'm wondering whether my current copy of marisa-build.exe is out-of-date for the purpose of updating my custom dictionaries to include the new prefix_exceptions file?

Here's the details behind my question. It was all going so well for a while ...

It was straightforward to create a prex.txt file (LF line-endings) containing variant_word tab headword_prefix for the variant words which do not share the same prefix as their headword using: marisa-build.exe -o prefix_exceptions prex.txt All appeared to be OK and I rebuilt the custom dictionary with no problems. It even seemed to work after installing on the Kobo.

The problem arose when I decided to double-check everything by converting prefix_exceptions back to TXT format using marisa-dump.exe: marisa-dump.exe prefix_exceptions > prex_marisa_dump.txt

I already knew to expect that prex_marisa_dump.txt would be created with CRLF line-endings and that it wouldn't be in the same sort sequence as the original prex.txt. Having allowed for those differences the 2 files ought to match.

However, the 2 files don't match because every input entry for prefix '11' comes back out of marisa-dump with the tab and the '11' missing. All other prefixes have out_entry matches in_entry, e.g. AOK\t11\n gets dumped as AOK\r\n (original headword A-OK) AOK\tao\n would get dumped correctly as AOK\tao\r\n

At first I thought marisa-dump.exe might be the problem but now I don't think so because when I used it to dump the copy of prefix_exceptions from the official new dicthtml.zip all the prefix '11' entries showed correctly in the dumped TXT. So it must be:

For completeness I did a similar marisa-build / marisa-dump 'round-trip' with the words file. I had no problem matching the dumped TXT back to the original input index TXT for (headwords + variants).

If I need to provide any extra info just ask.

pgaskin commented 3 years ago

This won't be an issue with the Marisa binaries themselves, but it's possible the tools don't support building tries with tabs. In any case, my in-progress version of dictutil works fine with those files without any updates to Marisa. I'll send you a build of it once I finish with the new numbers in the words trie. Alternatively, since you're doing everything manually, if you're fine writing a small amount of Go code, you could use the github.com/pgaskin/dictutil/marisa package directly to build the trie from a list of strings (either hard-coded or read with os.Open and bufio.Scanner), then use ioutil.WriteFile to write it (I might have time to do that myself later today).

jackiew1 commented 3 years ago

I'm afraid I'm not OK writing Go code but I've found a grubby workaround that seems to work - still testing.

If I stick an extra tab at the end of the line for var_words which redirect to prefix '11', i.e. write the prefix exceptions to TXT as AOK\t11\t\n rather than AOK\t11\n then I now get a correct lookup for word 'AOK'

I suppose it's Sod's Law that I didn't have this "bright idea" (possibly) before going to the trouble of writing it all down.

jackiew1 commented 3 years ago

My version of marisa-build definitely has a problem if the input file has 2 fields per line, separated by a tab, when the 2nd field is all digits. The problem does not occur if the input file has only one field and it's all digits.

So in Kobo terms there is no problem creating the words file. The prefix_exceptions file will only be problematic if the 2nd field (redirect to prefix) is all digits, i.e. '11'

I don't know how many existing MobileRead custom dictionaries use variants at all - probably very few.

Anyway, my workaround seems to work OK for me, so you can close this if you don't think marisa-build has an issue. I'm not sure I agree with you but maybe it's different on Linux.

pgaskin commented 3 years ago

I don't know how many existing MobileRead custom dictionaries use variants at all - probably very few.

None of the Penelope ones do, as it just discards variants of any kind. All of my personal dictionaries do, and so do about half the dictfiles I've recieved in support emails.

Anyway, my workaround seems to work OK for me, so you can close this if you don't think marisa-build has an issue. I'm not sure I agree with you but maybe it's different on Linux.

What I meant was marisa-build is a frontend to the actual Marisa library. There wouldn't be an issue with the Marisa library, so it's probably in the marisa-build. Thus, you can use my Go bindings to the Marisa library to create a custom frontend for it which doesn't have the parsing issues.