Open Moonbase59 opened 2 years ago
Not sure what you mean by bilingual-reverse. The way the dictionaries work is that they have 1 or 2 indices (or more, but that has no use currently and I don't think it works). The 2 indices ones are general translation/bilingual ones, and they work for both directions. And it is indeed only possible to generate translation dictionaries from text file input. It probably would not be that hard to change, but I've not really paid much attention to the single language dictionaries since I find them rather inconvenient to use UI-wise anyway (often need to click through to the HTML page instead of having the relevant information right there).
Interesting. Not having single-language dictionaries is sad—I, for example, almost never use "translation" dictionaries, but am instead interested in correct spelling, pronunciation, hyphenation, inflection and probably etymology of a word and thus regularly use Duden, Webster’s and Oxford (paper editions) or the dictionaries in my ebook reader (a Tolino Vision 5). Translation dictionaries usually don’t have that much information because they serve a different purpose. The Tolino e-readers even have different context menu entries: "Look up" ("Nachschlagen") and "Translate" ("Übersetzen").
Now your dictionary format(s) (and the Android app) have proved to be long-lived, rather small and performant, even on low-end hardware like an e-reader. Even to the point that all Tolinos include yours (albeit using v6), and on support requests tell users to go to your GitHub releases and download "more dictionaries" from here.
Unfortunately (and I can understand that!), you seem not to have time to adapt to the ever-changing Wiktionary styles, and thus, your dictionaries have sadly degraded to the point of being (almost) unusable, because they nowadays contain pages over pages of badly-formatted Wiktionary code fragments (see screenshots in https://github.com/rdoeffinger/Dictionary/issues/147).
Well, we users of your code could help out and find new ways of filtering, or create our own dictionaries. I thought. But I, personally, don’t talk Java, so can’t change the code, the dictionary formats are complicated ("I like it", the programmer of pyglossary once said, "but the dictionary format is too complicated to include in pyglossary."). Thus, no third-party tools either, and we users must rely on what you give us. Which, in turn, is great and performant, but lacks documentation for the "average Joe".
So, we kindly ask you for help. What can we do to assist?
For instance, the tab-separated dictionary of which I showed a part above, was created using a (rather speedy) Rexx script to convert from the Wiktionary XML dump, by user "tscho" on the E-Reader Forum, that does a great job of reducing the output to the necessary. It’s only needed to install "regina" on Linux (sudo apt install regina
on Debian-derivatives), run it, then replace all TAB characters against nothing, and finally replacing all <
(3 blanks and less-than) against \t<
, to make it a working input for, say, pyglossary, to create a StarDict dictionary, for instance.
Maybe that can help to make your dictionary content better again?
I’m sure many, many users of your Android app, end especially Tolino users, would be really happy for great new Wiktionary-based dictionaries, plus a documented way to create their own!
If there is a kind of "community" or just some people who are willing to invest some effort, I will give this higher priority and try to look into it. I guess I am wondering if this tab delimited format is the most sensible one though. I mean, if you could choose any formatting that is simple enough to describe and parse, would you choose that? Because implementing a different format might not be more work than adjusting this one really (I would need to double-check that though). Also for example the dictionary format to my understanding (which I would need to check against the specification) can do Heading (with optional link to HTML page) text across the whole page text with left side and right side Possibly that could be used to turn the single-language dictionaries into something even I would agree is useful. For example instead of just the heading with the link to the HTML page have also 1-2 lines of description of the word visible right there on the main page. Of course this doesn't all HAVE to be designed up-front, but especially if not this might need a bit of longer-term engagement to get it nice and polished.
I admit it is far too technical at the moment, but maybe a look at https://github.com/rdoeffinger/Dictionary/blob/master/dictionary-format.txt would be useful as well. For example this part gives the different types of entries the dictionary consists of:
That’d be great, thanks for considering! I do think there are actually two main types of usage:
This may well be the reason for the Tolinos to have separate "Look up" and "Translate" functions—and I love it.
For creating "personal" dictionaries like this (maybe a technical dictionary), users might have different preferences on what they want to be included in the final output. I have ever been thinking about some kind of templates for this, since Wiktionary is quite well-structured. One could think about creating "Wiktionary output templates" (per language, because Wiktionary structure differs so much), and thus, for instance, allow creators to include/exclude things like inflection, etymology, sample sentences. This would of course be some extra work, but might pay. (Assuming you would want to document/open it more for creators.)
I should think both E-Reader Forum (German) and MobileRead (mainly English) might be good communities to find people willing to help. E-Reader forum seems the more active one, especially for Tolino.
And yes, you’re completely right—the "one line per entry, tab-separated" format was just a measure of last resort in my case, since I so much wanted "readable" EN and DE single-language dictionaries (and, well, also an EN-DE and DE-EN "translation" version, for the rare cases…).
For the input, I’d actually prefer Wiktionary XML dumps, plus kinda templates for output (predefined "standard" templates included that’d only show the important content, like maybe "Advanced Learner’s" level). Fortunately, most devices can render HTML (at least partially), so this could be easy. The "standard" output should be able to fit on a rather small device (smartphone) as well as larger devices (like 7" or 10" e-readers).
This would be for the "main" dictionaries (and I much like the way you currently work on the "complete sets"—it takes ages but produces rather nice lang1-lang2 dictionaries, and Wiktionary will evolve…).
For user-created content (like technical or neologism dictionaries), I’d prefer a simpler input format, maybe even down to "tab-separated", since these will most often be single-language anyway. I think.
I think that's already a bit too many options there. The "Wiktionary output templates" needs to be far more concrete and something doable with a few hours of effort from my side. For example I could probably add that the "HTML" content gets piped to some external program for formatting. But even if means it can be a Python or shell script, it still needs someone to write that. Not sure if anyone who can do that could not almost as well go to src/com/hughes/android/dictionary/parser/wiktionary/AbstractWiktionaryParser.java and modify the HTML text there like the replaceSuperscript function does, so does that win anything?
And I forgot, the feature of using anything but purely the HTML entries is even missing from the wiktionary parser. So that part is not even really solved. I mean it's nice to have a long-term vision (but maybe it should be a bit more concrete), but there's also the question what a realistic first step would be, just making sure it's one that leads in the right direction.
Yeah, but it takes a vision to get inspired and motivated, and first feasible steps to get going. Maybe we’re both right. ;-)
The work you’ve done already is greatly appreciated, and your willingness to possibly invest some more time to it is even more appreciated. Possibly even by many, many Tolino users that don’t even know it. ;-)
My thoughts on "steps":
src/com/hughes/android/dictionary/parser/wiktionary/AbstractWiktionaryParser.java
might be a good first-first step, thanks for the pointer!{K|
, {Ü
etc. groups) should maybe be step 1. At least this would allow creating "clean" versions of what you already have—a big step forward! You might get some inspiration from the quoted Rexx script here—it produces rather clean results (and is fast like hell).I actually just created a post in E-Reader Forum, asking people for assistance. Let’s see what comes out of it.
And of course I’m willing to help, too, although I don’t speak Java. Maybe I could test, or suggest stuff, or whatever.
Not sure if the workaround works in the Tolino 5, but in the shine, you can create a translation dictionary with the same language in lang1 and lang2 and when long-tapping a word, select translate instead of lookup...
FYI: I just added support for the quickdic format (version 6) in pyglossary: https://github.com/ilius/pyglossary/pull/509
Thanks, I am glad someone made use of the specification I wrote! I really disliked the old format for its use of Java serialization that made it very hard to use the format in other languages than Java. Luckily it was possible to reduce it to a small set of "magic bytes". You say your implementation differs from the specification to be compatible with tolino, how? The specification is meant to document the format needed for tolino, and the dictionaries I generated using that worked on mine (I think)? Also how did you solve the sorting issue? Using even an different version of libicu results in dictionaries that do not work properly, which is very problematic. If you only tested ASCII you might not have noticed.
I have a Tolino Shine 3. In the settings, I can download dictionaries from the "Tolino Cloud". This adds *.quickdic
files to my file system in the folder .tolino/dictionaries/
. I designed the pyglossary plugin to be compatible with these files both reading and writing.
After looking at your specification again, and at the Java code, I have to reformulate my original statement: "The implementation agrees with the format specifciation, but it might or might not differ from the Java reference implementation in some aspects. My priority was to be compatible with Tolino ebook readers. While tweaking my implementation, I didn't always check that everything I did agreed with the Java code."
These are the things that caused most headaches for me when implementing the plugin:
&z<ȝ
to reproduce the sorting of the Tolino dictionaries.
I’m at a loss. I try to generate a single DE dictionary from a cleaned-up, tab-separated DE Wiktionary download.
The source files (I created a one-entry test) look like this:
(There is a TAB after the initial "Aal". All other tab characters inside the content were removed.)
Since there is not much documentation, I tried the following command (on Linux):
(taking over the spelling error "dewikitionary")
But this gives an error:
So I also tried adding
lang2=DE
:This does produce a dictionary (that can even be converted to v006 format using
./genv6.sh
):Only when I try to use this dictionary on my Tolino Vision 5, it makes the Tolino app crash. (I’m assuming there are differences between a) single-language, b) bilingual, and c) bilingual-reverse dictionaries?)
Is it even possible to create functioning single-language dictionaries (with HTML content, one line per entry) from TAB-separated text files?
Maybe I just made a simple mistake?