scala / scala-xml

The standard Scala XML library
Apache License 2.0
298 stars 92 forks source link

Name predicates: which XML version? #607

Open rossabaker opened 2 years ago

rossabaker commented 2 years ago

I am trying to implement a Scalacheck XML generator that round trips through writing and parsing. I've run into a discrepancy between the character sets in scala-xml and the JVM internals. Is it expected that scala-xml's alphabet targets a specific version of the XML spec? I'm finding that the scala-xml alphabet does not match the JVM's idea of XML 1.0 nor XML 1.1.

I tried to make this a scala-cli script, but I can't get it to accept the com.sun.org imports. I have to run this on Java 8 (specifically, I used 1.8.0_292) to avoid trouble with the module system.

import com.sun.org.apache.xml.internal.utils.XMLChar
import com.sun.org.apache.xml.internal.utils.XML11Char
import scala.xml.Utility

object Chars extends App {
  val allChars = (Char.MinValue to Char.MaxValue)

  val charSets = Map(
    "scala-xml-start"  -> ((c: Char) => Utility.isNameStart(c)),
    "xml-1.0-start"    -> ((c: Char) => XMLChar.isNameStart(c)),
    "xml-1.1-start"    -> ((c: Char) => XML11Char.isXML11NameStart(c)),

    "scala-xml"  -> ((c: Char) => Utility.isNameChar(c)),
    "xml-1.0"    -> ((c: Char) => XMLChar.isName(c)),
    "xml-1.1"    -> ((c: Char) => XML11Char.isXML11Name(c)),
  )

  def compare(a: String, b: String) = {
    val diff = allChars.filter(charSets(a)).filterNot(charSets(b))
    println(s"In ${a}, not ${b}: ${diff.size}")
    println(diff.take(10))
    println()
  }

  compare("scala-xml-start", "xml-1.0-start")
  compare("xml-1.0-start", "scala-xml-start")

  compare("scala-xml-start", "xml-1.1-start")
  compare("xml-1.1-start", "scala-xml-start")

  compare("scala-xml", "xml-1.0")
  compare("xml-1.0", "scala-xml")

  compare("scala-xml", "xml-1.1")
  compare("xml-1.1", "scala-xml")
}

scala-xml

In scala-xml-start, not xml-1.0-start: 13800
Vector(ª, µ, º, IJ, ij, Ŀ, ŀ, ʼn, ſ, DŽ)

In xml-1.0-start, not scala-xml-start: 11
Vector(ʻ, ʼ, ʽ, ʾ, ʿ, ˀ, ˁ, ՙ, ۥ, ۦ)

In scala-xml-start, not xml-1.1-start: 3
Vector(ª, µ, º)

In xml-1.1-start, not scala-xml-start: 5700
Vector(ʰ, ʱ, ʲ, ʳ, ʴ, ʵ, ʶ, ʷ, ʸ, ʹ)

In scala-xml, not xml-1.0: 14993
Vector(ª, µ, º, IJ, ij, Ŀ, ŀ, ʼn, ſ, DŽ)

In xml-1.0, not scala-xml: 4
Vector(·, ۝, ۞, ℮)

In scala-xml, not xml-1.1: 3
Vector(ª, µ, º)

In xml-1.1, not scala-xml: 4021
Vector(˂, ˃, ˄, ˅, ˒, ˓, ˔, ˕, ˖, ˗)

I think I can limit my generators to a characters that pass both the JVM's and scala-xml's predicate, but I'm curious if this difference is known and intentional. Thanks!

ashawley commented 2 years ago

I'm sure it's just XML 1.0 and not 1.1. I'm also not surprised it's inconsistent with the spec. The implementation for isNameChar in scala-xml hasn't fundamentally changed in 20 years. There's probably not someone around to explain the rationale for the differences.

rossabaker commented 2 years ago

I did some homework. tl;dr:

I am willing to update docs or synchronize the predicates with a particular XML standard.

--

What's defined in scalacheck-xml is fully consistent with the JDK's XMLChar. This is XML 1.0, Fourth Edition. I found an ancient rant about Fifth Edition, which is the status quo in Xerces.

The scaladoc on TokenParserTests (which Utility extends) refers to 1.0's Appendix B, which are also the Fourth Edition rules, now "orphaned" in Fifth Edition. That spec is based on Unicode 2.0 (JDK 1.1 era!), with some complicated exceptions.

Furthermore, since scala-xml just delegates to Unicode character types, its predicates are a function of the JVM version. XML 1.0 Fifth Edition's "intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names," but it's still a fixed set.