ilius / pyglossary

A tool for converting dictionary files aka glossaries. Mainly to help use our offline glossaries in any Open Source dictionary we like on any modern operating system / device.
GNU General Public License v3.0
2.15k stars 239 forks source link

Error reading MDict file zhwiki*.mdx: Error 1 #52

Closed jacksonsz closed 10 months ago

jacksonsz commented 8 years ago
- Adding body data.
- Preparing index data.
*** Error: Parse failure [҉    188845875   0   ҉          ].
normalize_key_text aborted.
Error.
make: *** [all] Error 1

Chinese Wikipedia 20160501 & Chinese Wikisource 20160501 http://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=13197&highlight=%CE%AC%BB%F9

Baidu network disk http://pan.baidu.com/s/1boZRhPt Password: kkxd

ilius commented 4 years ago

Please try again with the latest code. Also ensure python-lzo is installed before conversion: sudo pip3 install python-lzo

If still did not work, please paste PyGlossary's output.

jacksonsz commented 3 years ago

https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=22626&extra=page%3D1 【2020/10/11】中文喂鸡百科20201001文字正式版

百度网盘: https://pan.baidu.com/s/17dGlCIoaFyoThUBgV9lXOQ 提取码: rhx4

greydeMacBook-Pro:~ grey$ /Users/grey/Downloads/pyglossary4/main.py --write-format=AppleDict "/Users/grey/Downloads/dict/zhwiki20201001.mdx" zhwiki20201001
[WARNING] unknown config key 'noProgressBar', you may edit /Users/grey/Library/Preferences/PyGlossary/config.json file and remove this key
[WARNING] unknown config key 'ui_autoSetOutputFileName', you may edit /Users/grey/Library/Preferences/PyGlossary/config.json file and remove this key
LZO compression support is not available
[INFO] Found 1 mdd files with 111691 entries
[INFO] extracting links...
[INFO] extracting links done, sizeof(linksDict)=64
[INFO] wordCount = 2114739
[INFO] Invalid language code/name 'zhwiki' in match=('zhwiki', '-', '20201001')
[INFO] Failed to detect sourceLang and targetLang from glossary name 'zhwiki-20201001'

[INFO] Writing to AppleDict file '/Users/grey/zhwiki20201001'
[INFO] Using Reader class from OctopusMdict plugin for direct conversion without loading into memory
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bs4/__init__.py:332: MarkupResemblesLocatorWarning: "///" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bs4/__init__.py:332: MarkupResemblesLocatorWarning: "BIN" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bs4/__init__.py:332: MarkupResemblesLocatorWarning: "/DEV" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bs4/__init__.py:332: MarkupResemblesLocatorWarning: "MUSIC" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  warnings.warn(
^[[AConverting - |████████                                     | %20.0 ETA:  8:0Converting | |█████████████████████████████████████████████|%100.0 Time: 9:58:07

[INFO] Writing file '/Users/grey/zhwiki20201001' done.
[INFO] Running time of convert: 36018.9 seconds
greydeMacBook-Pro:~ grey$ pwd
/Users/grey
greydeMacBook-Pro:~ grey$ cd zhwiki20201001
greydeMacBook-Pro:zhwiki20201001 grey$ make
"""/Users/grey/Developer/Extras/Dictionary Development Kit"/bin"/build_dict.sh"  "zhwiki20201001" "zhwiki20201001.xml" "zhwiki20201001.css" "zhwiki20201001.plist"
- Building zhwiki20201001.dictionary.
- Checking source.
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'text-align:'
t;瑪德琳·史旺博士<br>Dr. Madeleine Swann<td style=" text-align:
                                                                               ^
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'text-align:'
haw<td><span class='no-key'>Q</span><td style=" text-align:
                                                                               ^
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'text-align:'
;br>Naomie Harris<td><br>Eve Moneypenny<td style=" text-align:
                                                                               ^
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'text-align:'
hristoph Waltz<td><br>Ernst Stavro Blofeld<td style=" text-align:
                                                                               ^
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'text-align:'
;<br>Rory Kinnear<td><br>Bill Tanner<td style=" text-align:
                                                                               ^
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'text-align:'
>瓦爾多·奧布魯切夫<br>Valdo Obruchev<td style=" text-align:
                                                                               ^
zhwiki20201001.xml:355: namespace error : Failed to parse QName 'bword:'
?</span><br/>Billy Magnussen</td><td>羅根·艾許<br/>Logan Ash</td><td bword:
                                                                               ^
zhwiki20201001.xml:19059: namespace error : Namespace prefix text-align for center on th is not defined
1&lt;/a&gt;]&lt;/sup&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19059: namespace error : Failed to parse QName 'background:'
key">良十二世</span>(1823年至1829年)</b></td></tr><tr><th background:
                                                                               ^
zhwiki20201001.xml:19059: namespace error : Namespace prefix text-align for center on th is not defined
"" style="background:#f0f0f0;''&gt;否決權&lt;td  style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19182: namespace error : Namespace prefix text-align for center on th is not defined
1&lt;/a&gt;]&lt;/sup&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19182: namespace error : Failed to parse QName 'background:'
key">庇護八世</span>(1829年至1830年)</b></td></tr><tr><th background:
                                                                               ^
zhwiki20201001.xml:19182: namespace error : Namespace prefix text-align for center on th is not defined
"" style="background:#f0f0f0;''&gt;否決權&lt;td  style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19230: namespace error : Namespace prefix text-align for center on th is not defined
1&lt;/a&gt;]&lt;/sup&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19230: namespace error : Failed to parse QName 'background:'
??我略十六世</span>(1831年至1846年)</b></td></tr><tr><th background:
                                                                               ^
zhwiki20201001.xml:19230: namespace error : Namespace prefix text-align for center on th is not defined
"" style="background:#f0f0f0;''&gt;否決權&lt;td  style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19602: namespace error : Namespace prefix text-align for center on th is not defined
1&lt;/a&gt;]&lt;/sup&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:19602: namespace error : Failed to parse QName 'background:'
key">庇護九世</span>(1846年至1878年)</b></td></tr><tr><th background:
                                                                               ^
zhwiki20201001.xml:19602: namespace error : Namespace prefix text-align for center on th is not defined
"" style="background:#f0f0f0;''&gt;否決權&lt;td  style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:20571: namespace error : Namespace prefix text-align for center on th is not defined
1&lt;/a&gt;]&lt;/sup&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:20571: namespace error : Failed to parse QName 'background:'
enter;"> <b>良十三世(1878年至1903年)</b></td></tr><tr><th background:
                                                                               ^
zhwiki20201001.xml:20571: namespace error : Namespace prefix text-align for center on th is not defined
"" style="background:#f0f0f0;''&gt;否決權&lt;td  style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:22845: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:22845: namespace error : Failed to parse QName 'background:'
enter;"> <b>庇護十世(1903年至1914年)</b></td></tr><tr><th background:
                                                                               ^
zhwiki20201001.xml:22845: namespace error : Namespace prefix text-align for center on th is not defined
"" style="background:#f0f0f0;''&gt;否決權&lt;td  style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:24981: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:26640: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:27468: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:27510: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:29646: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:29694: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:30543: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:30645: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:31725: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:31779: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:33270: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:33333: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:34125: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:34212: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:35112: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:35163: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:38484: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:39030: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:39114: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:40638: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:40677: namespace error : Failed to parse QName 'background:'
sup><a name="R1"></a>[<a href="bword://#1">1</a>]</sup></td></tr><tr background:
                                                                               ^
zhwiki20201001.xml:42126: namespace error : Namespace prefix border-collapse for collapse on table is not defined
e:collapse="" cellpadding="3" cellspacing="0" style="font-size:90%;" width="80%"
                                                                               ^
zhwiki20201001.xml:42126: namespace error : Namespace prefix border-collapse for collapse on table is not defined
e:collapse="" cellpadding="3" cellspacing="0" style="font-size:90%;" width="70%"
                                                                               ^
zhwiki20201001.xml:57244: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:61588: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:92475: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:92484: namespace error : Namespace prefix text-align for center on th is not defined
??舉人分布&lt;/b&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td style=" text-align:center=""
                                                                               ^
zhwiki20201001.xml:117432: namespace error : Namespace prefix text-align for left on tr is not defined
tyle="text-align:center; background:=FF8888&gt;&lt;td style=" text-align:left=""
                                                                               ^
zhwiki20201001.xml:117432: namespace error : Namespace prefix text-align for left on tr is not defined
tyle="text-align:center; background:=FF8888&gt;&lt;td style=" text-align:left=""
                                                                               ^
zhwiki20201001.xml:136621: parser error : error parsing attribute name
><table cellpadding="0" cellspacing="0" width="100%"><tr><td width="60%"><table 
                                                                               ^
zhwiki20201001.xml:136621: parser error : attributes construct error
><table cellpadding="0" cellspacing="0" width="100%"><tr><td width="60%"><table 
                                                                               ^
zhwiki20201001.xml:136621: parser error : Couldn't find end of Start Tag table
><table cellpadding="0" cellspacing="0" width="100%"><tr><td width="60%"><table 
                                                                               ^
zhwiki20201001.xml : failed to parse
Error.
make: *** [all] Error 1
realtmxi commented 10 months ago

I meet the same problem, I installed python-lzo, but I still cannot read the mdx file

ERROR:pyglossary: Traceback (most recent call last): File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/glossary_v2.py", line 648, in _openReader openResult = reader.open(filename) File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/plugins/octopus_mdict_new/init.py", line 101, in open self._mdx = MDX(filename, self._encoding, self._substyle) File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/plugin_lib/readmdict.py", line 687, in init MDict.init(self, fname, encoding, passcode) File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/plugin_lib/readmdict.py", line 129, in init self._key_list = self._read_keys() File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/plugin_lib/readmdict.py", line 410, in _read_keys return self._read_keys_v1v2() File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/plugin_lib/readmdict.py", line 490, in _read_keys_v1v2 key_block_info_list = self._decode_key_block_info(key_block_info) File "/Users/tianmuxin/Downloads/pyglossary/pyglossary/plugin_lib/readmdict.py", line 233, in _decode_key_block_info key_block_info = zlib.decompress(key_block_info_compressed[8:]) zlib.error: Error -3 while decompressing data: incorrect header check [CRITICAL] Reading file 'LDOCE5.mdx' failed. CRITICAL:pyglossary:Reading file 'LDOCE5.mdx' failed.

ilius commented 10 months ago

@jacksonsz for wikipedia, you should use wiktextract format: https://kaikki.org/dictionary/index.html

@realtmxi It doesn't look like the same problem. Please open a new issue, and attach your mdx file on that issue.