hunterhacker / jdom

Java manipulation of XML made easy
Other
344 stars 117 forks source link

getValue() does not return the complete CDATA #117

Closed Karthik9479 closed 11 years ago

Karthik9479 commented 11 years ago

My input xml is of this schema as available here http://www.asam.net/nc/de/home/standards/standard-detail.html?tx_rbwbmasamstandards_pi1%5BshowUid%5D=894&start= (Site needs registration to download the XSD)

In one of the

tag I have a CDATA which has the XML export from a JIRA system

When I try to get the CDATA contents using getValue() or getText(), I get the following

getValue() - Returns all the textuial content from CDATA without the xml tags getText() - Returns "<!CDATA[]>"

With the same XML, i am able to get the complete CDATA section in C# using System.XML without any issues.

rolfl commented 11 years ago

Hi Karthik. I have scoured the code, and, frankly, what you are reporting does not make sense. The only way it makes sense is if the elment you are doing the getText() on and th the one you do getValue() on are different Elements.... with different Text content...

I cannot do much more with the information I have at the moment, other than to recommend that you put the getText() and getValue() calls right next to each other so that I can be convinced that they are being operated on the same Element instances.

I have a feeling that you really do have an Element in your document that really has the text content &lt![CDATA[ ]]> or something, so that getText() returns <![CDATA[ ]]>

Karthik9479 commented 11 years ago

Thanks a lot for your feedback. Actually the input xml I am using is an output of a XSL tranformation. I shall try and post the code by deleting all the sensitive info within a day or two.

Thanks again for taking time for looking into this

rolfl commented 11 years ago

No feedback, so closing. Comment if there's a change

Karthik9479 commented 11 years ago

HI Rolf,

I am attaching a sample source code to reproduce the behavior.

Basically the problem happens when parsing the "in memory" Document object & does not happen when I open the XML file and process it

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringReader;
import java.util.List;

//import javax.xml.transform.Source;
//import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerFactoryConfigurationError;
import javax.xml.transform.stream.StreamSource;

//import org.jdom2.filter.Filters;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.transform.JDOMResult;
import org.jdom2.transform.JDOMSource;
import org.jdom2.util.IteratorIterable;
//import org.jdom2.xpath.XPathExpression;
//import org.jdom2.xpath.XPathFactory;
import org.jdom2.Content;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
//import org.jdom2.Namespace;

public class DemoParser {

    public static void main(String[] args) {
        final String SOURCE_XML = "<source><outer><inner>This has to be <b>escaped</b></inner></outer></source>";
        final String TRANSFORM = "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
                             + "<xsl:stylesheet version=\"1.0\""
                             + " xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\""
                             + " xmlns=\"http://www.example.org/schemas/foo/bar\""
                             + " xmlns:java=\"http://xml.apache.org/xslt/java\""
                             + " exclude-result-prefixes=\"java\"  >"
                             + "<xsl:template match=\"/\">"
                             + "<target>"
                             + "<xsl:apply-templates/>"
                             + "</target>"
                             + "</xsl:template>"
                             + "<xsl:template match=\"outer\">"
                             + "<core>"
                             + "<xsl:text disable-output-escaping=\"yes\"><![CDATA[<![CDATA[]]></xsl:text><xsl:copy-of select=\".\"/><xsl:text disable-output-escaping=\"yes\"><![CDATA[]]]]><![CDATA[>]]></xsl:text>"
                             + "</core>"
                             + "</xsl:template>"
                             + "</xsl:stylesheet>";

        final String OUT_FILE = "C:\\temp\\foo.xml";
        Document doc = null;
        try {
            doc = new SAXBuilder().build(new StringReader(SOURCE_XML));
            JDOMSource in = new JDOMSource(doc);
            JDOMResult out = new JDOMResult();
            TransformerFactory.newInstance().newTransformer(new StreamSource(new StringReader(TRANSFORM))).transform(in, out);
            System.out.println("JDOMResult document:");
            listDescendants(out.getDocument().getRootElement());
            XMLOutputter xmlOutput = new XMLOutputter();            
            xmlOutput.setFormat(Format.getCompactFormat());
            xmlOutput.output(out.getDocument(),  new FileOutputStream(OUT_FILE ));
            doc = new SAXBuilder().build(new FileInputStream(OUT_FILE));
            System.out.println("=============================================================");
            System.out.println("Parsed output file document:");
            listDescendants(doc.getRootElement());

        } catch (JDOMException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (TransformerConfigurationException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (TransformerException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (TransformerFactoryConfigurationError e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        /*
        System.out.println("Now let's try XPath...");
        XPathExpression<Element> xElements = XPathFactory.instance().compile("//cmns:ISSUE-ANNOTATIONS/cmns:ISSUE-ANNOTATION/cmns:ANNOTATION-TEXT/cmns:P", Filters.element(),null,Namespace.getNamespace("cmns", "http://www.asam.net/schemas/issue/issue310"));
        Element foundElement = xElements.evaluateFirst(doc);
        listDescendants(foundElement);
        */
    }

    public static void listDescendants(Element e) {
        if (e != null) {
            IteratorIterable<Content> iter = e.getDescendants();
            System.out.println("Listing content of element \"" + e.getName() + '"');
            while (iter.hasNext()) {
                Content content = iter.next();
                switch (content.getCType()) {
                case CDATA:
                    System.out.println("This is a CDATA node");
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                case Element:
                    System.out.println("This is an element node");
                    System.out.println("\tThe name is: " + ((Element) content).getName());
                    System.out.println("\tThe text is: " + ((Element) content).getText());
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                case Comment:
                    System.out.println("This is a comment");
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                case DocType:
                    System.out.println("This is a DocType");
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                case EntityRef:
                    System.out.println("This is an EntityRef");
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                case ProcessingInstruction:
                    System.out.println("This is a ProcessingInstruction");
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                case Text:
                    System.out.println("This is Text");
                    System.out.println("\tThe value is: \"" + content.getValue() + "\"");
                    break;
                default:
                    System.out.println("This is something completely different");
                    break;
                }
            }
        } else {
            System.out.println("No element");
        }
    }
    public static void showChildrenRecursive(Element myElement, int level) {
        if (myElement == null) {
            return;
        }
        String myName = myElement.getName();
        for (int i = 0; i < level; i++) {
            System.out.print("\t");
        }
        System.out.println("My name is " + myName + " and my  CType is " +  myElement.getCType());
        List<Element> myChildren = myElement.getChildren();
        for (Element iElem : myChildren) {
            showChildrenRecursive(iElem, level + 1);
        }
    }
}

Environment:

rolfl commented 11 years ago

Hi again. Thanks for the use-case. It explains the issue really well.

Unfortunately (for you), JDOM is doing the 'right thing'. In this case, I think you have a misunderstanding about what the output of the transformation is doing. It is adding 'TRAX Escaping' ProcessingInstructions to the output from the transformation. JDOM is 'handling' those instructions, and producing the expected results. Here is how it works:

The actual XML output from the transformation is:

<?xml version="1.0" encoding="UTF-8"?>
<target xmlns="http://www.example.org/schemas/foo/bar"><core><?javax.xml.transform.disable-output-escaping?>&lt;![CDATA[<?javax.xml.transform.enable-output-escaping?><outer><inner>This has to be<b>escaped</b></inner></outer><?javax.xml.transform.disable-output-escaping?>]]&gt;<?javax.xml.transform.enable-output-escaping?></core></target>

Notice all the PI's: <?javax.xml.transform.enable-output-escaping?> and <?javax.xml.transform.disable-output-escaping?>

These PI's tell 'supporting' XML processors to change the way they output the data. JDOM, by default, supports these PI's. You can change this support by adding the following line to your code before you output the XML document:

xmlOutput.getFormat().setIgnoreTrAXEscapingPIs(true);

If you do that, you will get the 'raw' (un-handled) XML in your C:\temp\foo.xml file.

Have a look at it, and you will see that, in the raw (unprocessed) XML, the only text content that is directly a descendant of the Element are the two parts: <![CDATA[ and ]]>.

The getValue() method does a recursive look for text, so it sees the full

<![CDATA[<outer><inner>This has to be<b>escaped</b></inner></outer>]]>  

When you have been outputting the file, it has been using the PI's to modify the output algorithm, and the output is structured differently, and has the intended content. Thus, when you read it back, it looks 'right'.

The bottom line is that the JDOM document is being handled correctly, and your use case just happens to be complicated and confusing. This, by the way, is a big issue with the escaping PI's in general, and they have been cause for much complaint

Karthik9479 commented 11 years ago

Thanks for clarifying. Guess we must implement some workaround

rolfl commented 11 years ago

You understand that the issue is that the XSLT process is creating a partially processed document, that is supposed to be completed by something that handles the escape enabling/disabling PI's. Once it is fully processed then the transformation is complete.

JDOM is a system that does the remaining process, so you can use JDOM as your 'workaround'. You should be able to do the following:

import org.jdom2.input.sax.SAXHandler;
.......

        // Your Code
        doc = new SAXBuilder().build(new StringReader(SOURCE_XML));
        JDOMSource in = new JDOMSource(doc);
        JDOMResult out = new JDOMResult();
        TransformerFactory.newInstance().newTransformer(
                          new StreamSource(new StringReader(TRANSFORM))
                   ).transform(in, out);

        // Output the transformed document through the SAXOutputter which will correctly process
        // any enable/disable escaping ProcessingInstructions which the transformation may create
        SAXHandler handler = new SAXHandler();
        SAXOutputter saxout = new SAXOutputter(handler);
        saxout.output(out.getDocument());
        // the following document should be 'stable', and contain the correct forms of Elements.
        Document fulltransformed = handler.getDocument();
gan-ainm commented 11 years ago

Is there a difference if an XMLOutputter is used instead of the SAXHandler and SAXOutputter? The output XML should be stored as a string later on, so I think the following should also do the trick:

// ...
XMLOutputter xmlOutput = new XMLOutputter();            
xmlOutput.setFormat(Format.getPrettyFormat());
StringWriter stringWriter = new StringWriter();
xmlOutput.output(out.getDocument(), stringWriter);
String result = stringWriter.toString();
// ...
rolfl commented 11 years ago

No difference. One of the major features of JDOM 2.x is that all the outputters use the same Format handling instead of just the XMLOutputter.