schierlm / BibleMultiConverter

Converter written in Java to convert between different Bible program formats
Other
126 stars 32 forks source link

Improve Raw HTML support in MyBibleZone importer #25

Closed ickc closed 5 years ago

ickc commented 5 years ago

Example, (ESVGSB.SQLite3 downloaded from the link you provided)

$ java -jar BibleMultiConverter-AllInOneEdition.jar MyBibleZone "ESVGSB.SQLite3" LogosHTML "ESVGSB.html"
WARNING: Skipping malformed metadata property html_style
WARNING: Unsupported HTML entity in  
WARNING: Unsupported HTML entity in The English Standard Version (ESV) stands in the classic mainstream of English Bible translations over the past half-millennium. The fountainhead of that stream was William Tyndale's New Testament of 1526; marking its course were the King James Version of 1611 (KJV), the English Revised Version of 1885 (RV), the American Standard Version of 1901 (ASV), and the Revised Standard Version of 1952 and 1971 (RSV). In that stream, faithfulness to the text and vigorous pursuit of accuracy were combined with simplicity, beauty, and dignity of expression. Our goal has been to carry forward this legacy for a new century.
WARNING: Unsupported HTML entity in ⇐
...

The resulting HTML is malformed. I found that a bunch of closing HTML comments are wrong, which can be fixed by sed 's/-->/-->/g'.

Also, there's a bunch of warning on Unsupported HTML entity such as ⇐ but other similar variants as well.

ickc commented 5 years ago

I also found that for unsupported HTML entity, it changes it to literal &. But then since the output file is HTML (at least in the case of LogosHTML), they should be left untouched. A temporary fix is to undo this literal & replacement: sed -i 's/&amp/\&/g'

schierlm commented 5 years ago

I guess you are seeing now what problems you face if you have embedded HTML in modules (like MyBibleZone ones). Either you allow for Raw HTML (and then you may get malformed HTML in the output) or you don't (and then when there is HTML that cannot be parsed, you lose information in case the destination format also allows for Raw HTML).

The current decision I took for MyBibleZone modules is: Inside of footnotes and introduction texts, raw HTML is allowed, while inside of verses all raw HTML gets stripped/replaced.

But I agree that the handling of entities can be improved (and unsupported entities should probably become Raw HTML even if they are in verses).

I will also have a look if I can sanitize the Raw HTML better so that no unbalanced tags can sneak through. And probably convert more raw HTML to formatting tags (e.g. <strong> to <b>) to reduce the need for Raw HTML.

For the record, the StrippedDiffable export format has an option to strip Raw HTML. That way, you will be guaranteed to not get any malformed HTML tags in your export, while losing some formatting in your footnotes/introductions.

schierlm commented 5 years ago

The malformed HTML is caused by incorrectly parsing embedded HTML like

<!--<img src="../Images/map_01_01.jpg" alt="The Near East at the Time of Genesis"/>-->

Where the end of the HTML is found at the first > instead of the second one. The extra > is then inserted as part of the text (not as Raw HTML), and LogosHTML exporter will replace by &gt; then.