schierlm / BibleMultiConverter

Converter written in Java to convert between different Bible program formats
Other
124 stars 33 forks source link

Disabling use of XML entities for utf-8 characters #62

Closed shadow-light closed 1 year ago

shadow-light commented 1 year ago

Hi thanks for this great converter. I'm converting USFM -> USX and noticed that it is producing XML entities instead of utf-8 characters even though the output encoding is utf-8.

Example:

\v 1 Iwamɨ́ó xwɨ́árí tɨ́nɨ aŋɨ́na tɨ́nɨ imɨxɨnɨŋíná eŋo nánɨ —Omɨ arɨ́á wirane negɨ́ sɨŋwɨ́ tɨ́ tɨ́nɨ wɨnɨrane sɨŋwɨ́ wɨnaxɨ́dɨrane wé tɨ́nɨ ɨ́á xɨrɨrane eŋwáorɨnɨ. Xwɨyɨ́á dɨŋɨ́ nɨyɨmɨŋɨ́ imónɨŋɨ́pɨ nánɨ neaíwapɨyiŋorɨnɨ.

<verse number="1" style="v" sid="1JN 1:1"/>Iwam&#616;&#769;&#243; xw&#616;&#769;&#225;r&#237; t&#616;&#769;n&#616; a&#331;&#616;&#769;na t&#616;&#769;n&#616; im&#616;x&#616;n&#616;&#331;&#237;n&#225; e&#331;o n&#225;n&#616; &#8212;Om&#616; ar&#616;&#769;&#225; wirane neg&#616;&#769; s&#616;&#331;w&#616;&#769; t&#616;&#769; t&#616;&#769;n&#616; w&#616;n&#616;rane s&#616;&#331;w&#616;&#769; w&#616;nax&#616;&#769;d&#616;rane w&#233; t&#616;&#769;n&#616; &#616;&#769;&#225; x&#616;r&#616;rane e&#331;w&#225;or&#616;n&#616;. Xw&#616;y&#616;&#769;&#225; d&#616;&#331;&#616;&#769; n&#616;y&#616;m&#616;&#331;&#616;&#769; im&#243;n&#616;&#331;&#616;&#769;p&#616; n&#225;n&#616; nea&#237;wap&#616;yi&#331;or&#616;n&#616;.<verse eid="1JN 1:1"/>

source

This is fine parsing wise, but it significantly increases file size, and I'm planning on serving them over network. Wondering if it's easy to disable this somehow?

schierlm commented 1 year ago

This is interesting. We use a custom XMLWriter to overwrite the significant whitespace rules (which are somehow odd in USX). I was not aware that this will automatically switch the character escape handler to DumbEscapeHandler, resulting in everything above U+0100 to be escaped.

It should be possible to get rid of this annoying behaviour, but I will have to have a closer look how exactly.

@Rolf-Smit: I assume you did not notice that behaviour when you did the USX revamp in #39?

schierlm commented 1 year ago

Nighly build which includes this fix as well as #64: https://nightly.link/schierlm/BibleMultiConverter/workflows/main.yaml/master/BibleMultiConverter-AllInOneEdition-Release.zip