rdoeffinger / DictionaryPC

Java code to generate dictionaries for QuickDic Android app (see Dictionary repo). Fork of project that used to be hosted at code.google.com/p/quickdic-dictionary
Apache License 2.0
16 stars 8 forks source link

How to manually generate a single dict from a tab-separated file? #8

Open Moonbase59 opened 2 years ago

Moonbase59 commented 2 years ago

I’m at a loss. I try to generate a single DE dictionary from a cleaned-up, tab-separated DE Wiktionary download.

The source files (I created a one-entry test) look like this:

Aal <i>Substantiv, m</i><i>, Aal, Pl. Aa·le</i><br><i>IPA:</i> [aːl]<br><b>Bedeutungen:</b><br>1. Zoologie: schlangenförmiger Süßwasser- und Meerwasserfisch aus der Ordnung der Aalartigen (Anguilliformes)<br>2. umgangssprachlich, und in Zusammensetzungen: Fisch oder Wassertier von länglicher Gestalt<br>3. Seefahrt, von U-Boot-Fahrern verwendete Bezeichnung für: Torpedo<br>4. Soldatensprache: neu eingezogener, unvereidigter Soldat, Soldat ohne Dienstgrad<br>5. junge Kalmuspflanze<br><b>Synonyme:</b><br>3. Torpedo<br>4. Brenner, Rotarsch, Sprutz, Zecke<br><b>Beispiele:</b><br>1. Er hatte einen sehr großen Aal gefangen.<br>1. Auf der Silberplatte lagen geräucherter Aal und Graved Lachs.<br>2. Meinst du, Freund der Tiere, man könnte Aal dazu sagen?<br>3. „Als mir der Torpedo endlich klar gemeldet wurde, gingen wir auf Gefechtskurs und nach Erreichen der Schussposition gab ich den Befehl ‚Torpedo los!‘ Der Aal verließ vorbildlich das Rohr, wurde dabei illegal vom I. Wachoffizier fotografiert und tauchte elegant ins Wasser.“ <br>4. Da zeigten es aber die Altgedienten den Aalen.<br>5. Zu jungen Kalmuspflanzen sagte Ehrenfried immer Aal und Herta ärgerte das.<br><br><i>Substantiv, mf, Nachname</i><i>, Aal, Pl. Aals</i><br><i>IPA:</i> [aːl]<br><b>Bedeutungen:</b><br>1. unterdurchschnittlich häufig auftretender, deutscher und niederländischer Nachname, Familienname häufigstes Vorkommen in Deutschland im Kreis Kleve in Nordrhein-Westfalen); auch mit Namenzusatz de<br><b>Beispiele:</b><br>1. Herr Aal heiratete Frau Müller im Mai. Nun heißt er Aal und sie Müller-Aal.<br>1. Heute sind wir bei Aals zu Besuch.<br>1. „Aal?! Vortreten!“<br>1. Der Aal ist's gewesen und die Aal hat's verpetzt.<br>1. Frau Aal ist ein Genie im Verkauf.<br>1. Herr Aal wollte uns kein Interview geben.<br>1. Die Aals kommen heute von Wangerooge.<br>1. Der Aal trägt nie die Schals, die die Aal ihm strickt.<br>1. Das kann ich dir aber sagen: „Wenn die Frau Aal kommt, geht der Herr Aal.“<br>1. Aal kommt und geht.<br>1. Aals kamen, sahen und siegten.<br>

(There is a TAB after the initial "Aal". All other tab characters inside the content were removed.)

Since there is not much documentation, I tried the following command (on Linux):

./run.sh --lang1=DE --lang1Stoplist=data/inputs/stoplists/de.txt --dictOut=data/outputs/DE.quickdic --dictInfo="Wiktionary-based DE dictionary" --input1=data/inputs/MCH-DE-DE-test.txt --input1Name=dewikitionary --input1Charset=UTF8 --input1Format=tab_separated

(taking over the spelling error "dewikitionary")

But this gives an error:

Running with arguments:
--lang1=DE
--lang1Stoplist=data/inputs/stoplists/de.txt
--dictOut=data/outputs/DE.quickdic
--dictInfo=Wiktionary-based DE dictionary
--input1=data/inputs/MCH-DE-DE-test.txt
--input1Name=dewikitionary
--input1Charset=UTF8
--input1Format=tab_separated
lang1=de
lang2=null
normalizerRules1=:: Lower; 'ae' > 'ä'; 'oe' > 'ö'; 'ue' > 'ü'; 'ß' > 'ss'; 
normalizerRules2=null
dictInfo=Wiktionary-based DE dictionary
dictOut=data/outputs/DE.quickdic
Processing: data/inputs/MCH-DE-DE-test.txt

Dec 11, 2021 1:06:15 AM com.hughes.android.dictionary.parser.DictFileParser parse
INFO: count=0, line=Aal <i>Substantiv, m</i><i>, Aal, Pl. Aa·le</i><br><i>IPA:</i> [aːl]<br><b>Bedeutungen:</b><br>1. Zoologie: schlangenförmiger Süßwasser- und Meerwasserfisch aus der Ordnung der Aalartigen (Anguilliformes)<br>2. umgangssprachlich, und in Zusammensetzungen: Fisch oder Wassertier von länglicher Gestalt<br>3. Seefahrt, von U-Boot-Fahrern verwendete Bezeichnung für: Torpedo<br>4. Soldatensprache: neu eingezogener, unvereidigter Soldat, Soldat ohne Dienstgrad<br>5. junge Kalmuspflanze<br><b>Synonyme:</b><br>3. Torpedo<br>4. Brenner, Rotarsch, Sprutz, Zecke<br><b>Beispiele:</b><br>1. Er hatte einen sehr großen Aal gefangen.<br>1. Auf der Silberplatte lagen geräucherter Aal und Graved Lachs.<br>2. Meinst du, Freund der Tiere, man könnte Aal dazu sagen?<br>3. „Als mir der Torpedo endlich klar gemeldet wurde, gingen wir auf Gefechtskurs und nach Erreichen der Schussposition gab ich den Befehl ‚Torpedo los!‘ Der Aal verließ vorbildlich das Rohr, wurde dabei illegal vom I. Wachoffizier fotografiert und tauchte elegant ins Wasser.“ <br>4. Da zeigten es aber die Altgedienten den Aalen.<br>5. Zu jungen Kalmuspflanzen sagte Ehrenfried immer Aal und Herta ärgerte das.<br><br><i>Substantiv, mf, Nachname</i><i>, Aal, Pl. Aals</i><br><i>IPA:</i> [aːl]<br><b>Bedeutungen:</b><br>1. unterdurchschnittlich häufig auftretender, deutscher und niederländischer Nachname, Familienname häufigstes Vorkommen in Deutschland im Kreis Kleve in Nordrhein-Westfalen); auch mit Namenzusatz de<br><b>Beispiele:</b><br>1. Herr Aal heiratete Frau Müller im Mai. Nun heißt er Aal und sie Müller-Aal.<br>1. Heute sind wir bei Aals zu Besuch.<br>1. „Aal?! Vortreten!“<br>1. Der Aal ist's gewesen und die Aal hat's verpetzt.<br>1. Frau Aal ist ein Genie im Verkauf.<br>1. Herr Aal wollte uns kein Interview geben.<br>1. Die Aals kommen heute von Wangerooge.<br>1. Der Aal trägt nie die Schals, die die Aal ihm strickt.<br>1. Das kann ich dir aber sagen: „Wenn die Frau Aal kommt, geht der Herr Aal.“<br>1. Aal kommt und geht.<br>1. Aals kamen, sahen und siegten.<br>
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
    at com.hughes.android.dictionary.parser.DictFileParser.parseLine(DictFileParser.java:164)
    at com.hughes.android.dictionary.parser.DictFileParser.parse(DictFileParser.java:102)
    at com.hughes.android.dictionary.engine.DictionaryBuilder.main(DictionaryBuilder.java:153)
    at com.hughes.android.dictionary.engine.Runner.main(Runner.java:29)

So I also tried adding lang2=DE:

./run.sh --lang1=DE --lang2=DE --lang1Stoplist=data/inputs/stoplists/de.txt --dictOut=data/outputs/DE.quickdic --dictInfo="Wiktionary-based DE dictionary" --input1=data/inputs/MCH-DE-DE-test.txt --input1Name=dewikitionary --input1Charset=UTF8 --input1Format=tab_separated

This does produce a dictionary (that can even be converted to v006 format using ./genv6.sh):

Running with arguments:
--lang1=DE
--lang2=DE
--lang1Stoplist=data/inputs/stoplists/de.txt
--dictOut=data/outputs/DE.quickdic
--dictInfo=Wiktionary-based DE dictionary
--input1=data/inputs/MCH-DE-DE-test.txt
--input1Name=dewikitionary
--input1Charset=UTF8
--input1Format=tab_separated
lang1=de
lang2=de
normalizerRules1=:: Lower; 'ae' > 'ä'; 'oe' > 'ö'; 'ue' > 'ü'; 'ß' > 'ss'; 
normalizerRules2=:: Lower; 'ae' > 'ä'; 'oe' > 'ö'; 'ue' > 'ü'; 'ß' > 'ss'; 
dictInfo=Wiktionary-based DE dictionary
dictOut=data/outputs/DE.quickdic
Processing: data/inputs/MCH-DE-DE-test.txt

Dec 11, 2021 1:09:04 AM com.hughes.android.dictionary.parser.DictFileParser parse
INFO: count=0, line=Aal <i>Substantiv, m</i><i>, Aal, Pl. Aa·le</i><br><i>IPA:</i> [aːl]<br><b>Bedeutungen:</b><br>1. Zoologie: schlangenförmiger Süßwasser- und Meerwasserfisch aus der Ordnung der Aalartigen (Anguilliformes)<br>2. umgangssprachlich, und in Zusammensetzungen: Fisch oder Wassertier von länglicher Gestalt<br>3. Seefahrt, von U-Boot-Fahrern verwendete Bezeichnung für: Torpedo<br>4. Soldatensprache: neu eingezogener, unvereidigter Soldat, Soldat ohne Dienstgrad<br>5. junge Kalmuspflanze<br><b>Synonyme:</b><br>3. Torpedo<br>4. Brenner, Rotarsch, Sprutz, Zecke<br><b>Beispiele:</b><br>1. Er hatte einen sehr großen Aal gefangen.<br>1. Auf der Silberplatte lagen geräucherter Aal und Graved Lachs.<br>2. Meinst du, Freund der Tiere, man könnte Aal dazu sagen?<br>3. „Als mir der Torpedo endlich klar gemeldet wurde, gingen wir auf Gefechtskurs und nach Erreichen der Schussposition gab ich den Befehl ‚Torpedo los!‘ Der Aal verließ vorbildlich das Rohr, wurde dabei illegal vom I. Wachoffizier fotografiert und tauchte elegant ins Wasser.“ <br>4. Da zeigten es aber die Altgedienten den Aalen.<br>5. Zu jungen Kalmuspflanzen sagte Ehrenfried immer Aal und Herta ärgerte das.<br><br><i>Substantiv, mf, Nachname</i><i>, Aal, Pl. Aals</i><br><i>IPA:</i> [aːl]<br><b>Bedeutungen:</b><br>1. unterdurchschnittlich häufig auftretender, deutscher und niederländischer Nachname, Familienname häufigstes Vorkommen in Deutschland im Kreis Kleve in Nordrhein-Westfalen); auch mit Namenzusatz de<br><b>Beispiele:</b><br>1. Herr Aal heiratete Frau Müller im Mai. Nun heißt er Aal und sie Müller-Aal.<br>1. Heute sind wir bei Aals zu Besuch.<br>1. „Aal?! Vortreten!“<br>1. Der Aal ist's gewesen und die Aal hat's verpetzt.<br>1. Frau Aal ist ein Genie im Verkauf.<br>1. Herr Aal wollte uns kein Interview geben.<br>1. Die Aals kommen heute von Wangerooge.<br>1. Der Aal trägt nie die Schals, die die Aal ihm strickt.<br>1. Das kann ich dir aber sagen: „Wenn die Frau Aal kommt, geht der Herr Aal.“<br>1. Aal kommt und geht.<br>1. Aals kamen, sahen und siegten.<br>
Done: data/inputs/MCH-DE-DE-test.txt

Most common tokens:
  Aal@0(1)
Most common tokens:
  1@0(1)
  2@2(1)
  3@4(1)
  4@6(1)
  5@8(1)
  aːl@10(1)
  Aa@12(1)
  Aal@14(1)
  Aalartigen@16(1)
  Aalen@18(1)
  Aals@20(1)
  aber@22(1)
  Als@24(1)
  Altgedienten@26(1)
  Anguilliformes@28(1)
  ärgerte@30(1)
  auch@32(1)
  auf@34(1)
  Auf@36(1)
  auftretender@38(1)
  aus@40(1)
  b@42(1)
  Bedeutungen@44(1)
  Befehl@46(1)
  bei@48(1)
  Beispiele@50(1)
  Besuch@52(1)
  Bezeichnung@54(1)
  Boot@56(1)
  br@58(1)
  Brenner@60(1)
  Da@62(1)
  dabei@64(1)
  das@66(1)
  Das@68(1)
  dazu@70(1)
  de@72(1)
  den@74(1)
  der@76(1)
  Der@78(1)
  deutscher@80(1)
  Deutschland@82(1)
  die@84(1)
  Die@86(1)
  Dienstgrad@88(1)
  dir@90(1)
  du@92(1)
  Ehrenfried@94(1)
  ein@96(1)
  einen@98(1)
Writing dictionary to: data/outputs/DE.quickdic
sources start: 44
RAFList stats: 0x1 entries
pair start: 74
RAFList stats: 1x64 entries
uncompressed min 2085, max 2085, sum 2085, average 2085.0
compressed min 1120, max 1120, sum 1120, average 1120.0
text start: 1205
RAFList stats: 0x1 entries
html index start: 1212
RAFList stats: 0x64 entries
html data start: 1219
RAFList stats: 0x128 entries
indices start: 1227
RAFList stats: 1x32 entries
uncompressed min 14, max 14, sum 14, average 14.0
compressed min 20, max 20, sum 20, average 20.0
RAFList stats: 7x32 entries
uncompressed min 104, max 647, sum 3385, average 483.57144
compressed min 86, max 410, sum 2153, average 307.57144
RAFList stats: 0x1 entries
end: 5133

Only when I try to use this dictionary on my Tolino Vision 5, it makes the Tolino app crash. (I’m assuming there are differences between a) single-language, b) bilingual, and c) bilingual-reverse dictionaries?)

Is it even possible to create functioning single-language dictionaries (with HTML content, one line per entry) from TAB-separated text files?

Maybe I just made a simple mistake?

rdoeffinger commented 2 years ago

Not sure what you mean by bilingual-reverse. The way the dictionaries work is that they have 1 or 2 indices (or more, but that has no use currently and I don't think it works). The 2 indices ones are general translation/bilingual ones, and they work for both directions. And it is indeed only possible to generate translation dictionaries from text file input. It probably would not be that hard to change, but I've not really paid much attention to the single language dictionaries since I find them rather inconvenient to use UI-wise anyway (often need to click through to the HTML page instead of having the relevant information right there).

Moonbase59 commented 2 years ago

Interesting. Not having single-language dictionaries is sad—I, for example, almost never use "translation" dictionaries, but am instead interested in correct spelling, pronunciation, hyphenation, inflection and probably etymology of a word and thus regularly use Duden, Webster’s and Oxford (paper editions) or the dictionaries in my ebook reader (a Tolino Vision 5). Translation dictionaries usually don’t have that much information because they serve a different purpose. The Tolino e-readers even have different context menu entries: "Look up" ("Nachschlagen") and "Translate" ("Übersetzen").

Now your dictionary format(s) (and the Android app) have proved to be long-lived, rather small and performant, even on low-end hardware like an e-reader. Even to the point that all Tolinos include yours (albeit using v6), and on support requests tell users to go to your GitHub releases and download "more dictionaries" from here.

Unfortunately (and I can understand that!), you seem not to have time to adapt to the ever-changing Wiktionary styles, and thus, your dictionaries have sadly degraded to the point of being (almost) unusable, because they nowadays contain pages over pages of badly-formatted Wiktionary code fragments (see screenshots in https://github.com/rdoeffinger/Dictionary/issues/147).

Well, we users of your code could help out and find new ways of filtering, or create our own dictionaries. I thought. But I, personally, don’t talk Java, so can’t change the code, the dictionary formats are complicated ("I like it", the programmer of pyglossary once said, "but the dictionary format is too complicated to include in pyglossary."). Thus, no third-party tools either, and we users must rely on what you give us. Which, in turn, is great and performant, but lacks documentation for the "average Joe".

So, we kindly ask you for help. What can we do to assist?

For instance, the tab-separated dictionary of which I showed a part above, was created using a (rather speedy) Rexx script to convert from the Wiktionary XML dump, by user "tscho" on the E-Reader Forum, that does a great job of reducing the output to the necessary. It’s only needed to install "regina" on Linux (sudo apt install regina on Debian-derivatives), run it, then replace all TAB characters against nothing, and finally replacing all < (3 blanks and less-than) against \t<, to make it a working input for, say, pyglossary, to create a StarDict dictionary, for instance.

Maybe that can help to make your dictionary content better again?

I’m sure many, many users of your Android app, end especially Tolino users, would be really happy for great new Wiktionary-based dictionaries, plus a documented way to create their own!

rdoeffinger commented 2 years ago

If there is a kind of "community" or just some people who are willing to invest some effort, I will give this higher priority and try to look into it. I guess I am wondering if this tab delimited format is the most sensible one though. I mean, if you could choose any formatting that is simple enough to describe and parse, would you choose that? Because implementing a different format might not be more work than adjusting this one really (I would need to double-check that though). Also for example the dictionary format to my understanding (which I would need to check against the specification) can do Heading (with optional link to HTML page) text across the whole page text with left side and right side Possibly that could be used to turn the single-language dictionaries into something even I would agree is useful. For example instead of just the heading with the link to the HTML page have also 1-2 lines of description of the word visible right there on the main page. Of course this doesn't all HAVE to be designed up-front, but especially if not this might need a bit of longer-term engagement to get it nice and polished.

I admit it is far too technical at the moment, but maybe a look at https://github.com/rdoeffinger/Dictionary/blob/master/dictionary-format.txt would be useful as well. For example this part gives the different types of entries the dictionary consists of:

means: 1: index into list_of([pair_entry]) 2: index into list_of([index_entry]) (mark as "main word header" entry) 3: index into list_of([text_entry]) 4: index into list_of([index_entry]) (mark as "extra info/translation" entry) 5: index into list_of([html_entry])/list_of([html_entry_data]) Your current idea would only allow you to create that last type of entry, which is the only one currently used by the single-language dictionaries. That to me doesn't feel like it's enough to get the most out of the app... Btw I had a quick look into how single-language dictionaries are generated: you only specify a lang1 and no lang2. I don't think that will change that it's not possible with the input format you have.
Moonbase59 commented 2 years ago

That’d be great, thanks for considering! I do think there are actually two main types of usage:

  1. Users that only want to translate a few words in a foreign-language text (i.e., from current ebook language to their native tongue).
  2. Users that are native speakers (or rather fluent, or intense learners) that want "to know more" about a word/phrase (like etymology, longer explanation, pronunciation, hyphenation, probably sample sentences). These would use the more elaborate "single-language" dictionaries (as opposes to a shorter "translation-only" dictionary).

This may well be the reason for the Tolinos to have separate "Look up" and "Translate" functions—and I love it.

For creating "personal" dictionaries like this (maybe a technical dictionary), users might have different preferences on what they want to be included in the final output. I have ever been thinking about some kind of templates for this, since Wiktionary is quite well-structured. One could think about creating "Wiktionary output templates" (per language, because Wiktionary structure differs so much), and thus, for instance, allow creators to include/exclude things like inflection, etymology, sample sentences. This would of course be some extra work, but might pay. (Assuming you would want to document/open it more for creators.)

I should think both E-Reader Forum (German) and MobileRead (mainly English) might be good communities to find people willing to help. E-Reader forum seems the more active one, especially for Tolino.

And yes, you’re completely right—the "one line per entry, tab-separated" format was just a measure of last resort in my case, since I so much wanted "readable" EN and DE single-language dictionaries (and, well, also an EN-DE and DE-EN "translation" version, for the rare cases…).

For the input, I’d actually prefer Wiktionary XML dumps, plus kinda templates for output (predefined "standard" templates included that’d only show the important content, like maybe "Advanced Learner’s" level). Fortunately, most devices can render HTML (at least partially), so this could be easy. The "standard" output should be able to fit on a rather small device (smartphone) as well as larger devices (like 7" or 10" e-readers).

This would be for the "main" dictionaries (and I much like the way you currently work on the "complete sets"—it takes ages but produces rather nice lang1-lang2 dictionaries, and Wiktionary will evolve…).

For user-created content (like technical or neologism dictionaries), I’d prefer a simpler input format, maybe even down to "tab-separated", since these will most often be single-language anyway. I think.

rdoeffinger commented 2 years ago

I think that's already a bit too many options there. The "Wiktionary output templates" needs to be far more concrete and something doable with a few hours of effort from my side. For example I could probably add that the "HTML" content gets piped to some external program for formatting. But even if means it can be a Python or shell script, it still needs someone to write that. Not sure if anyone who can do that could not almost as well go to src/com/hughes/android/dictionary/parser/wiktionary/AbstractWiktionaryParser.java and modify the HTML text there like the replaceSuperscript function does, so does that win anything?

rdoeffinger commented 2 years ago

And I forgot, the feature of using anything but purely the HTML entries is even missing from the wiktionary parser. So that part is not even really solved. I mean it's nice to have a long-term vision (but maybe it should be a bit more concrete), but there's also the question what a realistic first step would be, just making sure it's one that leads in the right direction.

Moonbase59 commented 2 years ago

Yeah, but it takes a vision to get inspired and motivated, and first feasible steps to get going. Maybe we’re both right. ;-)

The work you’ve done already is greatly appreciated, and your willingness to possibly invest some more time to it is even more appreciated. Possibly even by many, many Tolino users that don’t even know it. ;-)

My thoughts on "steps":

I actually just created a post in E-Reader Forum, asking people for assistance. Let’s see what comes out of it.

And of course I’m willing to help, too, although I don’t speak Java. Maybe I could test, or suggest stuff, or whatever.

fizban99 commented 2 years ago

Not sure if the workaround works in the Tolino 5, but in the shine, you can create a translation dictionary with the same language in lang1 and lang2 and when long-tapping a word, select translate instead of lookup...

tuxor1337 commented 1 year ago

FYI: I just added support for the quickdic format (version 6) in pyglossary: https://github.com/ilius/pyglossary/pull/509

rdoeffinger commented 1 year ago

Thanks, I am glad someone made use of the specification I wrote! I really disliked the old format for its use of Java serialization that made it very hard to use the format in other languages than Java. Luckily it was possible to reduce it to a small set of "magic bytes". You say your implementation differs from the specification to be compatible with tolino, how? The specification is meant to document the format needed for tolino, and the dictionaries I generated using that worked on mine (I think)? Also how did you solve the sorting issue? Using even an different version of libicu results in dictionaries that do not work properly, which is very problematic. If you only tested ASCII you might not have noticed.

tuxor1337 commented 1 year ago

I have a Tolino Shine 3. In the settings, I can download dictionaries from the "Tolino Cloud". This adds *.quickdic files to my file system in the folder .tolino/dictionaries/. I designed the pyglossary plugin to be compatible with these files both reading and writing.

After looking at your specification again, and at the Java code, I have to reformulate my original statement: "The implementation agrees with the format specifciation, but it might or might not differ from the Java reference implementation in some aspects. My priority was to be compatible with Tolino ebook readers. While tweaking my implementation, I didn't always check that everything I did agreed with the Java code."

These are the things that caused most headaches for me when implementing the plugin:

  1. Each index entry has a "start_index" and a "count" field that refer to blocks in the "rows" list. But the "count" is actually not the block size, but it is the block size minus one. Furthermore, the first "row" in the corresponding block always contains a reference back to the index entry. And the "type" of that first "row" is "1" or "3" depending on whether the index entry points to a "html" or "pair" type entry. I assume that this has to do with the "prunedRowIdx" in your implementation, but I didn't double-check, and I think it is not mentioned in the format specification.
  2. The entry "types" in the "rows" list range from 0 to 4. Your format specification mentions 1 to 5. Is it possible, that this needs to be shifted by 1, actually?
  3. The sort order of the tokens in the index: Only the normalized version of "tokens" is compared, the non-normalized version is ignored during sorting. After I noticed this, everything worked well for German, Italian, French, and Spanish. But for English, I needed to enforce the collation rule &z<ȝ to reproduce the sorting of the Tolino dictionaries.