ibmruntimes / Semeru-Runtimes

Issue repo for all things IBM Semeru Runtimes
14 stars 3 forks source link

GB18030 character 龦 can't be parsed by SAXParser #54

Open zhxiaoliibm opened 1 year ago

zhxiaoliibm commented 1 year ago

this is the xml file: GB1803_002.zip

this is the demo code:

         SAXParserFactory spf = SAXParserFactory.newInstance();
         DummySAXEventHandler saxParserHandler = new DummySAXEventHandler();

         try {
             SAXParser saxParser = spf.newSAXParser();
             saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
             saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
             XMLReader xmlReader = saxParser.getXMLReader();
             xmlReader.setContentHandler(saxParserHandler);
             xmlReader.setEntityResolver(saxParserHandler);
             xmlReader.parse(xmlFileName);
         }
         catch (SAXException e) {
         }
         catch (Exception e) {
         }

when I try to parse this xml file, it will throw an org.xml.sax.SAXParseException with error message

lineNumber: 8; columnNumber: 11; Element type "aBB〢_郎嗀秊_" must be followed by either attribute specifications, ">" or "/>".
pshipton commented 1 year ago

Does this need GB18030-2022 support? Support for GB18030-2022 is added in the next release. You didn't mention which version/platform you are using. There are some preliminary builds for the next release available as announced in Slack at https://openj9.slack.com/archives/C01C8PL6319/p1689274282032669, copying the details here.

Semeru Open Edition Milestone 2 build for the July release has been published on a subset of platforms ~https://github.com/ibmruntimes/semeru8-binaries/releases/tag/jdk8u382-b04_openj9-0.40.0-m2~ https://github.com/ibmruntimes/semeru11-binaries/releases/tag/jdk-11.0.20%2B7_openj9-0.40.0-m2 https://github.com/ibmruntimes/semeru17-binaries/releases/tag/jdk-17.0.8%2B6_openj9-0.40.0-m2 ~https://github.com/ibmruntimes/semeru20-binaries/releases/tag/jdk-20.0.1%2B9_openj9-0.40.0-m2~

pshipton commented 1 year ago

The support for GB18030-2022 is not in the preliminary jdk8 or jdk20 builds, but will be in the final builds.

knn-k commented 1 year ago

I reproduced the failure using Semeru 11.0.20+7 m2 above. Interestingly, the SAX parser reads the xml file successfully when I replace all the occurrences of the character '龦' (U+9FA6) by '龥' (U+9FA5).

Semeru 17.0.8+6 m2 gives the same result.

knn-k commented 1 year ago

The following program gives the same result with 11.0.19, 11.0.20, 17.0.7, and 17.0.8. Both U+9FA5 and U+9FA6 are defined, and their type is 5 (Character.OTHER_LETTER).

public class CharType {

    public static void showProperties(char c) {
        System.out.println("U+" + Integer.toHexString(c) + ": Type=" + Character.getType(c) + ", isDefined=" + Character.isDefined(c));
    }

    public static void main(String[] args) {
        showProperties('\u9FA5');
        showProperties('\u9FA6');
    }

}
pshipton commented 1 year ago

Does it work on a Temurin build? Until the next release is completed, the most recent builds are nightly builds. https://adoptium.net/temurin/nightly/

knn-k commented 1 year ago

Temurin 11.0.20-beta fails in the same way.

[Fatal Error] GB1803_002.xml:8:11: Element type "aBB〢_郎嗀秊_" must be followed by either attribute specifications, ">" or "/>".

$ jdk-11.0.20+7/bin/java -version
openjdk version "11.0.20-beta" 2023-07-18
OpenJDK Runtime Environment Temurin-11.0.20+7-202307151707 (build 11.0.20-beta+7-202307151707)
OpenJDK 64-Bit Server VM Temurin-11.0.20+7-202307151707 (build 11.0.20-beta+7-202307151707, mixed mode)
pshipton commented 1 year ago

Seems the problem, if it is a problem, should be reported to OpenJDK.