natowi / quickdic-dictionary.dictionarypc

Automatically exported from code.google.com/p/quickdic-dictionary.dictionarypc
Apache License 2.0
0 stars 1 forks source link

WARNING: Malformed line #1

Open christophlingg opened 3 years ago

christophlingg commented 3 years ago

Hello!

I have a Tolino and wanted to use dict.cc's Spanish dictionary on my device. I followed the instruction of https://github.com/Gitsaibot/Toligen :

-agentlib:hprof=heap=sites,depth=20 ICU4J=/usr/share/java/icu4j-49.1.jar test -r "$ICU4J" || ICU4J=/usr/share/icu4j-55/lib/icu4j.jar XERCES=/usr/share/java/xercesImpl.jar test -r "$XERCES" || XERCES=/usr/share/xerces-2/lib/xercesImpl.jar COMMONS=/usr/share/java/commons-lang3.jar test -r "$COMMONS" || COMMONS=/usr/share/commons-lang-3.3/lib/commons-lang.jar COMMONS_COMPRESS=/usr/share/java/commons-compress-1.13.jar JAVA=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java test -x "$JAVA" || JAVA=java "$JAVA" -jar DictionaryBuilder.jar --lang1="DE" --lang2="ES" --dictInfo="dictcc-based DE-ES" --lang1Stoplist=data/inputs/stoplists/es.txt --input1=data/inputs/dictcc.txt --input1Name=dictcc --dictOut=data/outputs/DE-ES_dictcc.quickdic --input1Charset=UTF8 --input1Format=tab_separated

Many valuable entries are skipped with this error message:

WARNING: Malformed line: Atomphysik {f} física {f} atómica noun [phys.]

Which can be related to this line of code; https://github.com/natowi/quickdic-dictionary.dictionarypc/blob/master/src/com/hughes/android/dictionary/parser/DictFileParser.java#L117

Is the project still active and are there intentions to fix bugs like those?

I was considering writing a python version of it but unfortunately, I could not find any specification of quickdic. is it somewhere documented @natowi ?

rdoeffinger commented 3 years ago

Btw I have documented the old Tolino format here: https://github.com/rdoeffinger/Dictionary/blob/master/dictionary-format-v6.txt But as the DictionaryBuilder thanks to native-image is now possible to compile into proper Windows and Linux binaries it might be overkill to write a Python version. Especially as it's all very sensitive to word ordering, collation rules etc so you would need to make sure to use the proper version of ICU for all that, which sounds rather painful to do in Python.