rdoeffinger / Dictionary

"QuickDic" offline Dictionary App for Android. Provided downloadable dictionaries are based on Wiktionaries but can also be created from other sources (see DictionaryPC). Remember to use --recursive when cloning! Fork of project that used to be hosted at code.google.com/p/quickdic-dictionary.
Apache License 2.0
327 stars 69 forks source link

Dictionaries contain too much Wiktionary code stuff, possible to filter better? #147

Open Moonbase59 opened 2 years ago

Moonbase59 commented 2 years ago

From https://github.com/rdoeffinger/Dictionary/releases/tag/v0.3-oldformat, I downloaded the dictionaries EN.quickdic, DE.quickdic and EN-DE.quickdic and installed these on my Tolino Vision 5 (firmware 15.2.0).

As you can see from the attached screenshots, a lookup of the word "character" in EN.quickdic produces 14 pages, badly formatted with lots of Wiktionary code stuff; a try to translate "character" into German using EN-DE.quickdic produces 31+ pages, partly more useful, but still containg lots of Wiktionary code.

This bloats the output so much that your otherwise real nice dictionaries become nearly unusable, which is a shame.

Can I kindly suggest that you revise the Wiktionary code building process a little, to adapt for all this (unwanted/unneeded) extra information, in order to arrive at a more usable output again?

Let me know if you need more information, or how I could possibly help – thanks!

EN.quickdic-character-screenshots.zip

EN-DE.quickdic-character-screenshots.zip

P.S.: I also believe that "Hyphenation" (page 3/14 of "character" lookup) should be a separate subsection (like "Pronunciation"), and the parts of the word more likely displayed like "char·ac·ter".

P.P.S.: I think the output should look more like the Wiktionary page, leaving out the ToC and the non-English parts, probably even the audio player links for pronunciation. (Most e-readers either don’t have audio, might have no Internet connection, or simply won’t be able to switch to the browser and play audio, then return to the same page in the ebook.)

Efreak commented 1 year ago

Wikimedia's Enterprise HTML Dumps are generated monthly and contain rendered HTML (vs wikicode) of all pages on Wiktionary, separated by language. This world make parsing and filtering wikicode unnecessary; all you'd need to do is either preprocess to remove things you don't want, but injecting a little CSS would probably be enough for that.

woj-tek commented 1 year ago

Wikimedia's Enterprise HTML Dumps

This looks interesting but it would/could probably take more space than wikicode...