librehat / kdictionary-lingoes

A Lingoes dictionary file (LD2/LDX) reader/extractor. Written in C++ with Qt
GNU General Public License v3.0
77 stars 29 forks source link

Missing xml tags for some library #2

Closed yangchenyun closed 9 years ago

yangchenyun commented 10 years ago

With this dictionary, the converted file missing all the xml tags (used by lingoes or other client to render styles).


"File: /private/tmp/Oxford Advanced Learner's English-Chinese Dictionary.ld2"
"Type: LD2"
"Version: 2.4"
"ID: 0x4E443A5BC0DB0E12"
"Summary Addr: 4170"
"Summary Type: 3"
"Dictionary Type: 0x3"
"Index Numbers: 20585"
"Index Address/Size: 0x418C / 82340B"
"Compressed Data Address/Size: 0x199A0 / 9529815B"
"Phrases Index Address/Size(Decompressed): 0x0 / 214710B"
"Phrases Address/Size(Decompressed): 0x346B6 / 158601B"
"XML Address/Size(Decompressed): 0x5B23F / 23100088"
"File Size(Decompressed): 22923KB"
"Decompressing 1433 data streams."
"Phrases Encoding: UTF-8"
"XML Encoding: UTF-8"
"Extracted 21470 entries."

For example, the "asset" definition is converted to plain ASCII text

#line 1167
asset=/ ˈæset; ˋæsɛt/ n  ~ (to sb/sth) (a) valuable or useful quality or skill 有价值的或有用的特性或技能: Good health is a great asset. 健康就是莫大的财富. (b) valuable or useful person 有价值的或有用的人: He's an enormous asset to the team. 他是队里的骨干.  (usu pl 通常作复数) thing, esp property, owned by a person, company, etc that has value and can be used or sold to pay debts (属於个人或公司所有, 可用以抵偿债务或变卖後支付债务的)财产, 资产: His assets included shares in the company and a house in London. 他的财产包括公司的股票和位於伦敦的房子. Cf 参看 liability. # `asset-stripping n [U] (commerce 商) practice of buying at a cheap price a company with financial difficulties and then selling its assets individually to make a profit 资产倒卖(廉价收买经济上有困难的公司, 然後将其资产逐一变卖获利的做法).

But in Lingoes and Eudic, it is rendered with styles.

screen shot 2014-09-25 at 6 22 17 pm

Another dictionary has this problem is Longman

This time, it does detect some XML tags but missed partial of the definition for some words.

#line 2282
asset=as<b>·</b>set<br/><b>W2S2</b> <font face="Lingoes Unicode" color="#009900">/ˈæset/</font> <i>n</i>  <font color="#009999">[C]</font> <br/>[<font color="#F14B35">Date:</font> 1800-1900; <font color="#F14B35">Origin:</font> assets (singular) <i>'enough money to pay debts'</i>  (16-19 centuries), from <i>Anglo-French</i>  asetz, from <i>Old French</i>  assez <i>'enough'</i>]<br/>

screen shot 2014-09-25 at 6 25 54 pm

librehat commented 10 years ago

This is because the application will try to trim down those tags.

I'll use an command line argument to control this.

yangchenyun commented 10 years ago

Does this switches already built-in, or could you point me to some code which trims down those tags and I could recompile and use it in a non-trimming way?

librehat commented 10 years ago

line 251. The strip() function did the trimming work.

Sorry I'm too busy these days to do the modification.

yangchenyun commented 9 years ago

Wow, I did some hacking to trim this as well, forget to send the pull request. Thanks for that.