stleary / JSON-java

A reference implementation of a JSON package in Java.
http://stleary.github.io/JSON-java/index.html
Other
4.53k stars 2.56k forks source link

XML.toJSONObject - xml content trimmed #695

Open ilmagowalter opened 2 years ago

ilmagowalter commented 2 years ago

when create a JSONObject from XML String like

<?xml version="1.0" encoding="utf-8"?><tagA> : </tagA>

spaces at then begin and at the end of string are trimmed

    public static void main(String[] args) {
        String xml = "<?xml version=\"1.0\" encoding=\"utf-8\"?><tagA>  :</tagA>";
        JSONObject jo = XML.toJSONObject(xml, true);
        System.out.println("Test 1");
        System.out.println(jo.toString());

        xml = "<?xml version=\"1.0\" encoding=\"utf-8\"?><tagA>  :  </tagA>";
        jo = XML.toJSONObject(xml, true);
        System.out.println("Test 2");
        System.out.println(jo.toString());

        xml = "<?xml version=\"1.0\" encoding=\"utf-8\"?><tagA>:  </tagA>";
        jo = XML.toJSONObject(xml, true);
        System.out.println("Test 3");
        System.out.println(jo.toString());
    }

output

Test 1
{"tagA":":"}
Test 2
{"tagA":":"}
Test 3
{"tagA":":"}

is possible add a parameter to avoid trimming ?

i think that method involved is nextContent() of XMLTokener.java

iashok22 commented 2 years ago

@ilmagowalter , is there any reason why do we need to retain trailing/beginning spaces?

ilmagowalter commented 2 years ago

i'm facing this use case

i receive a xml file with field like <tagA> : </tagA> this tag is defined in xsd schema like

                              <xs:element name="tagA" minOccurs="0"> 
                                 <xs:annotation> 
                                    <xs:appinfo> 
                                    <RicSDO:exampleValues> 
                                          <RicSDO:example value="18:29"/> 
                                       </RicSDO:exampleValues> 
                                    </xs:appinfo> 
                                 </xs:annotation> 
                                 <xs:simpleType> 
                                    <xs:restriction base="xs:string"> 
                                       <xs:length value="5"/> 
                                    </xs:restriction> 
                                 </xs:simpleType> 
                              </xs:element> 

so the parsed is succesful; i work with this file, trasform to json and store on nosql database. One of possibilities is export data to xml again; so...if i store in json {"tagA": ":" } i will export <tagA>:</tagA> and with this element the parse with xsd fail; i have to store {"tagA": " : " }

iashok22 commented 2 years ago

@ilmagowalter Thanks for the details.

@stleary Can I add a new method to retain spaces in the string while parsing? like.. new JSONObject().parse(String input, boolean retainSpace)

ilmagowalter commented 2 years ago

maybe the parameter have to be added in this method

public static JSONObject toJSONObject(String string, boolean keepStrings)

stleary commented 2 years ago

The easiest fix is to wrap the " : " content in a CDATA section.

If that is not possible, you can try adding a flag to XMLParserConfiguration. This class is the preferred mechanism for special cases in XML parsing. The code might then look something like this:

    XMLParserConfiguration config =
       new XMLParserConfiguration().withKeepTrimmedSpaces(true);
    JSONObject jsonObject = XML.toJSONObject(xmlStr, config);

Then you will also need to update XMLTokener.nextContent() and pass in the config object or at least the flag. JSONML uses this method too, so take care not to break that code. This approach means that none of the content data in the XML doc will have spaces trimmed, which may not be what you want.

ilmagowalter commented 2 years ago

to test only nextContent() method i make this change

    public Object nextContent() throws JSONException {
        char         c;
        StringBuilder sb;
//        do {
            c = next();
//        } while (Character.isWhitespace(c));
        if (c == 0) {
            return null;
        }
        if (c == '<') {
            return XML.LT;
        }
        sb = new StringBuilder();
        for (;;) {
            if (c == 0) {
//                return sb.toString().trim();
                return sb.toString();
            }
            if (c == '<') {
                back();
//                return sb.toString().trim();
                return sb.toString();
            }
            if (c == '&') {
                sb.append(nextEntity(c));
            } else {
                sb.append(c);
            }
            c = next();
        }
    }

and seems works, but,.. in my project ( not writed by me ), before calling XML.toJSONObject, i have a trasformer ( this.transformer = TransformerFactory.newInstance().newTransformer(); ) that trasform Node to xml to

<?xml version="1.0" encoding="UTF-8" standalone="no"?><CodiceRegione>
    190</CodiceRegione>

the output is

{"CodiceRegione":"\r\n 190"} unfortunatly the real origin (xml origin...various trasformation to node and then ) tag was 190 so probably i have the problem on trasformer...but i don't know like work...any ideas ?

stleary commented 2 years ago

Character.isWhitespace() filters more than just the space char. See https://www.geeksforgeeks.org/character-iswhitespace-method-in-java-with-examples/ You could try retaining the call to isWhitespace() while allowing space chars that are contiguous with the content. For example, This string contains 8 whitespace chars at the beginning and end: " \r\n : \r\n "
But the parsed string should only contain 2 whitespace chars: " : ".

stleary commented 1 year ago

Closing due to lack of activity. If you think it should be reopened, please post here.

Brian-McG commented 12 months ago

@stleary I have a similar use-case as raised in this issue where I need to retain any existing whitespace between the XML and JSON so they are as close as possible for audit purposes.

Would you accept a pull-request to XMLParserConfiguration to add a new flag if I added it and wired it through to nextContent()?

stleary commented 12 months ago

@Brian-McG Sure, this would be allowed. Please ensure in your implementation:

keatontaylor10 commented 12 months ago

I have been working on the above with @Brian-McG. A change was implemented in nextToken() which, when the flag is true, removes the trimming from the string like so:

...
       do {
            c = next();
        } while (Character.isWhitespace(c) && configuration.shouldTrimWhiteSpace());
        if (c == 0) {
            return null;
        }
        if (c == '<') {
            return XML.LT;
        }
        sb = new StringBuilder();
        for (;;) {
            if (c == 0) {
                return sb.toString().trim();
            }
            if (c == '<') {
                back();
                if (configuration.shouldTrimWhiteSpace()) {
                    return sb.toString().trim();
                } else return sb.toString();
            }
...

We ran into an issue where whitespace in between tags is no longer being trimmed and ends up inside of the returned JSON object. An example of such an input:

                "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"+
                        "<addresses>\n"+
                        "   <address>\n"+
                        "      <name> Sherlock Holmes </name>\n"+
                        "   </address>\n"+
                        "</addresses>";

And the result is

{"addresses":{"address":{"name":" Sherlock Holmes ","content":["\n      ","\n   "]},"content":["\n   ","\n"]}}

To address this, a method has been added which executes before jsonObject is accumulated onto context in the parse() method. This method removes any entry where the key is the string returned by config.getcDataTagName() and value is only whitespace. I have forked this repo and pushed my change to a branch, diff can be viewed here: https://github.com/keatontaylor10/JSON-java/commit/218d00ecf0e331796b2aafb22172b0243e4e1c44. All tests are passing successfully and some more have been added. I wanted to get some feedback on whether or not this implementation will work before adding some more test cases and creating a pull request.

stleary commented 12 months ago

@keatontaylor10 The parser code can be tough to update without including unintended side effects, so running into problems like this should be expected. Not sure of the best approach to get the behavior you want without the side effects.

keatontaylor10 commented 12 months ago

I have created a PR to add this feature https://github.com/stleary/JSON-java/pull/832