Open rossabaker opened 2 years ago
I'm sure it's just XML 1.0 and not 1.1. I'm also not surprised it's inconsistent with the spec. The implementation for isNameChar in scala-xml hasn't fundamentally changed in 20 years. There's probably not someone around to explain the rationale for the differences.
I did some homework. tl;dr:
I am willing to update docs or synchronize the predicates with a particular XML standard.
--
What's defined in scalacheck-xml is fully consistent with the JDK's XMLChar. This is XML 1.0, Fourth Edition. I found an ancient rant about Fifth Edition, which is the status quo in Xerces.
The scaladoc on TokenParserTests
(which Utility
extends) refers to 1.0's Appendix B, which are also the Fourth Edition rules, now "orphaned" in Fifth Edition. That spec is based on Unicode 2.0 (JDK 1.1 era!), with some complicated exceptions.
ª
(0xaa) is excluded by the spec because it has "a font or compatibility decomposition".ʻ
(0x2bb) is included by the spec "because the property file classifies them as Alphabetic".Furthermore, since scala-xml just delegates to Unicode character types, its predicates are a function of the JVM version. XML 1.0 Fifth Edition's "intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names," but it's still a fixed set.
I am trying to implement a Scalacheck XML generator that round trips through writing and parsing. I've run into a discrepancy between the character sets in scala-xml and the JVM internals. Is it expected that scala-xml's alphabet targets a specific version of the XML spec? I'm finding that the scala-xml alphabet does not match the JVM's idea of XML 1.0 nor XML 1.1.
I tried to make this a scala-cli script, but I can't get it to accept the com.sun.org imports. I have to run this on Java 8 (specifically, I used 1.8.0_292) to avoid trouble with the module system.
scala-xml
I think I can limit my generators to a characters that pass both the JVM's and scala-xml's predicate, but I'm curious if this difference is known and intentional. Thanks!