Closed ickc closed 5 years ago
I also found that for unsupported HTML entity, it changes it to literal &
. But then since the output file is HTML (at least in the case of LogosHTML), they should be left untouched. A temporary fix is to undo this literal &
replacement: sed -i 's/&/\&/g'
I guess you are seeing now what problems you face if you have embedded HTML in modules (like MyBibleZone ones). Either you allow for Raw HTML (and then you may get malformed HTML in the output) or you don't (and then when there is HTML that cannot be parsed, you lose information in case the destination format also allows for Raw HTML).
The current decision I took for MyBibleZone modules is: Inside of footnotes and introduction texts, raw HTML is allowed, while inside of verses all raw HTML gets stripped/replaced.
But I agree that the handling of entities can be improved (and unsupported entities should probably become Raw HTML even if they are in verses).
I will also have a look if I can sanitize the Raw HTML better so that no unbalanced tags can sneak through. And probably convert more raw HTML to formatting tags (e.g. <strong>
to <b>
) to reduce the need for Raw HTML.
For the record, the StrippedDiffable
export format has an option to strip Raw HTML. That way, you will be guaranteed to not get any malformed HTML tags in your export, while losing some formatting in your footnotes/introductions.
The malformed HTML is caused by incorrectly parsing embedded HTML like
<!--<img src="../Images/map_01_01.jpg" alt="The Near East at the Time of Genesis"/>-->
Where the end of the HTML is found at the first >
instead of the second one. The extra >
is then inserted as part of the text (not as Raw HTML), and LogosHTML exporter will replace by >
then.
Example, (
ESVGSB.SQLite3
downloaded from the link you provided)The resulting HTML is malformed. I found that a bunch of closing HTML comments are wrong, which can be fixed by
sed 's/-->/-->/g'
.Also, there's a bunch of warning on Unsupported HTML entity such as
⇐
but other similar variants as well.