morbac / xmltools

XML Tools plugin for Notepad++
GNU General Public License v3.0
260 stars 57 forks source link

Non-Roman digits in XML attribute strings - plugin crash report #44

Open DavidHaslam opened 4 years ago

DavidHaslam commented 4 years ago

We programmers are so centered on 0-9 style digits that we often don’t properly handle other writing systems, yet we live in a world that has many. There's a systemic bias towards English, albeit not one that's consciously intended.

In a moment of sheer whimsy, earlier today, I made a TextPipe filter that processed an OSIS XML file to replace every digit within sID eID osisID osisRef attributes by the corresponding Devanagari digit.

i.e. This changed all the numerics in Bible references to Devanagari. Other digits were left untouched. So at the very least, all chapter and verse numbers were changed,

NB. This also changed several standard book name abbreviations, such as "1Sam" to "१Sam".

To illustrate the output, here's a sample that was Psalm 1 in the KJV Bible module source text.

<chapter osisID="Ps.१">
<title type="chapter"><abbr expansion="Psalm"><hi type="spaced-letters">PSAL</hi>.</abbr> I.</title>
<verse sID="Ps.१.१" osisID="Ps.१.१"/><w lemma="strong:H0835">Blessed</w> <transChange type="added">is</transChange> <w lemma="strong:H0376">the man</w> <w lemma="strong:H01980" morph="strongMorph:TH8804">that walketh</w> <w lemma="strong:H06098">not in the counsel</w> <w lemma="strong:H07563">of the ungodly</w>, <w lemma="strong:H05975" morph="strongMorph:TH8804">nor standeth</w> <w lemma="strong:H01870">in the way</w> <w lemma="strong:H02400">of sinners</w>, <w lemma="strong:H03427" morph="strongMorph:TH8804">nor sitteth</w> <w lemma="strong:H04186">in the seat</w> <w lemma="strong:H03887" morph="strongMorph:TH8801">of the scornful</w>.<note type="study" osisRef="Ps.१.१" osisID="Ps.१.१!note.a" n="a"><catchWord osisRef="Ps.१.१@s[ungodly]">ungodly</catchWord>: or, <rdg type="alternate">wicked</rdg>.</note><verse eID="Ps.१.१"/>
<verse sID="Ps.१.२" osisID="Ps.१.२"/><w lemma="strong:H02656">But his delight</w> <transChange type="added">is</transChange> <w lemma="strong:H08451">in the law</w> <w lemma="strong:H03068">of the <divineName>Lord</divineName></w>; <w lemma="strong:H08451">and in his law</w> <w lemma="strong:H01897" morph="strongMorph:TH8799">doth he meditate</w> <w lemma="strong:H03119">day</w> <w lemma="strong:H03915">and night</w>.<verse eID="Ps.१.२"/>
<verse sID="Ps.१.३" osisID="Ps.१.३"/><w lemma="strong:H06086">And he shall be like a tree</w> <w lemma="strong:H08362" morph="strongMorph:TH8803">planted</w> <w lemma="strong:H06388">by the rivers</w> <w lemma="strong:H04325">of water</w>, <w lemma="strong:H05414" morph="strongMorph:TH8799">that bringeth forth</w> <w lemma="strong:H06529">his fruit</w> <w lemma="strong:H06256">in his season</w>; <w lemma="strong:H05929">his leaf</w> <w lemma="strong:H05034" morph="strongMorph:TH8799">also shall not wither</w>; <w lemma="strong:H06213" morph="strongMorph:TH8799">and whatsoever he doeth</w> <w lemma="strong:H06743" morph="strongMorph:TH8686">shall prosper</w>.<note type="study" osisRef="Ps.१.३" osisID="Ps.१.३!note.a" n="a"><catchWord osisRef="Ps.१.३@s[wither]">wither</catchWord>: <abbr expansion="Hebrew">Heb.</abbr> <rdg type="x-literal">fade</rdg>.</note><verse eID="Ps.१.३"/>
<verse sID="Ps.१.४" osisID="Ps.१.४"/><w lemma="strong:H07563">The ungodly</w> <transChange type="added">are</transChange> not so: but <transChange type="added">are</transChange> <w lemma="strong:H04671">like the chaff</w> <w lemma="strong:H07307">which the wind</w> <w lemma="strong:H05086" morph="strongMorph:TH8799">driveth away</w>.<verse eID="Ps.१.४"/>
<verse sID="Ps.१.५" osisID="Ps.१.५"/><w lemma="strong:H07563">Therefore the ungodly</w> <w lemma="strong:H06965" morph="strongMorph:TH8799">shall not stand</w> <w lemma="strong:H04941">in the judgment</w>, <w lemma="strong:H02400">nor sinners</w> <w lemma="strong:H05712">in the congregation</w> <w lemma="strong:H06662">of the righteous</w>.<verse eID="Ps.१.५"/>
<verse sID="Ps.१.६" osisID="Ps.१.६"/><w lemma="strong:H03068">For the <divineName>Lord</divineName></w> <w lemma="strong:H03045" morph="strongMorph:TH8802">knoweth</w> <w lemma="strong:H01870">the way</w> <w lemma="strong:H06662">of the righteous</w>: <w lemma="strong:H01870">but the way</w> <w lemma="strong:H07563">of the ungodly</w> <w lemma="strong:H06" morph="strongMorph:TH8799">shall perish</w>.<verse eID="Ps.१.६"/>
</chapter>

When I used XMLTools to check XML Syntax, the plugin crashed and gave the following error message popup.

https://www.dropbox.com/s/3bk4z4ew6escrtl/Screenshot%202020-05-04%2019.06.00.png?dl=0

I wonder whether this might be a significant upstream problem in a library that is used by many other open source programs. That's a sound reason for reporting it here to begin with.

Aside: My apparently strange action did have a real world context. viz.

I am in discussion with a colleague about some of the regular expressions in the .xsd file used for OSIS validation. The regexp pattern \p{N} is used repeatedly therein. This stands for any Unicode character with the number property, which also covers fractions, superscript and subscript numbers, numbers in circles, numbers like 10,000, roman numerals, and other letter like numbers!

My curious experiment never got as far as an attempt to validate the OSIS, as it hit the plugin crash even for XML Syntax Check. In case anyone was already wondering, I should put on record that the input XML file sent to my filter passes both syntax check and OSIS validation.

DavidHaslam commented 4 years ago

Q. Do we know what XML per se permits as characters within an attribute string, irrespective of any particular schema?

Yes - we do - see Attribute Value

/ any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. /

So my whimsical experiment was in fact quite reasonable.

DavidHaslam commented 4 years ago

btw. The same error message ensues even if only one digit in one attribute is changed to Devanagari.

LetMeSleepAlready commented 4 years ago

I tried this out.. but can not reproduce it with the latest version. see picture below. image

I run windows 10, and tried notepad++ in both 32bit and 64bit mode. As far as I know the validation is done by a standard library provided with windows (MSXML). However this was changed "recently", as before it was a different library (libxml?). What version of XmlTools are you using?