Closed frmichel closed 4 years ago
What's the motivation for this?
Without this change, parsing JSON documents where some strings contain double-quotes or "\n" fails.
You mean serializing not parsing, right? Could you please add a test case that would currently fail but pass after the change? Thanks
Hi Markus,
Yes, you are right, I answered without double-checking. The problem occurs when I serialize NQuads: if a string in the JSON document contains double-quotes or newlines, the generated RDF cannot be parsed properly.
I found this problem with the Macaulay library Web API. Try this: https://search.macaulaylibrary.org/catalog.json?action=new_search&searchField=animals&sort=upload_date_desc&mediaType=a&taxonCode=t-11034463
The result contains two strings that start like this:
NOTES: LNS notes: Filtered with Zsys z-Q2, lo shelf, -95 dB, 500Hz. \u0002Confidence interval assigned by MLNS. 06Oct2004MLM. DAT case notes: \"STAR 2003
Note the escaped double-quote (\") before the word "STAR". Then, I parse the JSON content with the simple profile below:
"@context": {
"@base": "http://sms.i3s.unice.fr/item/",
"@vocab": "http://sms.i3s.unice.fr/terms/api/",
"mediaUrl": { "@type": "@id" },
"thumbnailUrl": { "@type": "@id" },
"specimenUrl": { "@type": "@id" }
}
}
In the serialized result I get this:
_:b34 <http://sms.i3s.unice.fr/terms/api/comments>
"NOTES: LNS notes: Filtered with Zsys z-Q2, lo shelf, -95 dB, 1kHz. ?Confidence interval assigned by MLNS. 08Oct2004MLM. DAT case notes: "STAR 2003; BOW #17; 01:08:07-01:14:02; #614 (Delphinus d.)." Julie Oswald's Spreadsheet Notes: "Species: Delphinus delphis; Tape: DSJ BOW17; Track: 4; Comments: 1:08:07 to 1:14:02 on DAT." David Starr Jordan 2003 Final Cruise Report, http://swfsc.nmfs.noaa.gov/prd/PROJECTS/star/default.htm: "Study Area: The eastern tropical Pacific Ocean (ETP)". SEAOCEAN." .
You can see that the escaped double-quote has been replaced with a non-escaped double-quote that mixes up with the double-quotes surrounding the whole string. The fix I propose re-escapes the double-quote after it was "de-escaped". Maybe there is a more clever way to do it, so that the double-quote would not be "de-escaped" in the first place.
The same type of issue occurs with newlines (CR) although I can't manage to find the example that made me do this fix.
Regards, Franck.
Added a test and merged in efc9a8b171f1e0615fa3a87cbcfa0f86b2d6965f
The second PR. Cheers, Franck.