lanthaler / JsonLD

JSON-LD processor for PHP
MIT License
335 stars 62 forks source link

Escape double-quotes and "\n" in JSON strings #86

Closed frmichel closed 4 years ago

frmichel commented 6 years ago

The second PR. Cheers, Franck.

lanthaler commented 5 years ago

What's the motivation for this?

frmichel commented 5 years ago

Without this change, parsing JSON documents where some strings contain double-quotes or "\n" fails.

lanthaler commented 5 years ago

You mean serializing not parsing, right? Could you please add a test case that would currently fail but pass after the change? Thanks

frmichel commented 5 years ago

Hi Markus,

Yes, you are right, I answered without double-checking. The problem occurs when I serialize NQuads: if a string in the JSON document contains double-quotes or newlines, the generated RDF cannot be parsed properly.

I found this problem with the Macaulay library Web API. Try this: https://search.macaulaylibrary.org/catalog.json?action=new_search&searchField=animals&sort=upload_date_desc&mediaType=a&taxonCode=t-11034463

The result contains two strings that start like this:

NOTES: LNS notes: Filtered with Zsys z-Q2, lo shelf, -95 dB, 500Hz.  \u0002Confidence interval assigned by MLNS. 06Oct2004MLM. DAT case notes: \"STAR 2003

Note the escaped double-quote (\") before the word "STAR". Then, I parse the JSON content with the simple profile below:

  "@context": {
    "@base": "http://sms.i3s.unice.fr/item/",
    "@vocab": "http://sms.i3s.unice.fr/terms/api/",
    "mediaUrl": { "@type": "@id" },
    "thumbnailUrl": { "@type": "@id" },
    "specimenUrl": { "@type": "@id" }
  }
}

In the serialized result I get this:

_:b34 <http://sms.i3s.unice.fr/terms/api/comments> 
  "NOTES: LNS notes: Filtered with Zsys z-Q2, lo shelf, -95 dB, 1kHz.  ?Confidence interval assigned by MLNS. 08Oct2004MLM. DAT case notes: "STAR 2003; BOW #17; 01:08:07-01:14:02; #614 (Delphinus d.)." Julie Oswald's Spreadsheet Notes: "Species: Delphinus delphis; Tape: DSJ BOW17; Track: 4; Comments: 1:08:07 to 1:14:02 on DAT." David Starr Jordan 2003 Final Cruise Report, http://swfsc.nmfs.noaa.gov/prd/PROJECTS/star/default.htm: "Study Area: The eastern tropical Pacific Ocean (ETP)".  SEAOCEAN." .

You can see that the escaped double-quote has been replaced with a non-escaped double-quote that mixes up with the double-quotes surrounding the whole string. The fix I propose re-escapes the double-quote after it was "de-escaped". Maybe there is a more clever way to do it, so that the double-quote would not be "de-escaped" in the first place.

The same type of issue occurs with newlines (CR) although I can't manage to find the example that made me do this fix.

Regards, Franck.

lanthaler commented 4 years ago

Added a test and merged in efc9a8b171f1e0615fa3a87cbcfa0f86b2d6965f