piprate / json-gold

A JSON-LD processor for Go
Apache License 2.0
259 stars 30 forks source link

Should handle less-than and other disallowed characters #47

Open alexkreidler opened 3 years ago

alexkreidler commented 3 years ago

I have an entry in a JSON-LD file like this:

        {
          "@id": "ex:BOE/code/INSTRUMENTS/LDA>1Y",
          "@type": "skos:Concept",
          "skos:prefLabel": "Medium and long term deposits",
          "skos:notation": "LDA>1Y"
        },

It gets converted by this library into:

<https://example.com/BOE/code/INSTRUMENTS/LDA>1Y> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
<https://example.com/BOE/code/INSTRUMENTS/LDA>1Y> <http://www.w3.org/2004/02/skos/core#hasTopConcept> <https://example.com/BOE/code/INSTRUMENTS> .
<https://example.com/BOE/code/INSTRUMENTS/LDA>1Y> <http://www.w3.org/2004/02/skos/core#notation> "LDA>1Y" .
<https://example.com/BOE/code/INSTRUMENTS/LDA>1Y> <http://www.w3.org/2004/02/skos/core#prefLabel> "Medium and long term deposits" .

As you can see, the LDA>1Y> section is problematic because N-Triples parsers fail at that position. They view the IRI as already being closed.

I'm not sure if the JSON-LD spec has anything to say about this, i.e. whether the > should be URL encoded, or if the serializer should just return an error.

But the library should do one of those two: either serialize it properly or throw an error, rather than silently emit invalid N-Triples.

Let me know if I can provide more info. Thanks for this awesome library!

kazarena commented 3 years ago

@alexkreidler thank you for reporting the issue. The problem is clear. At first glance, I'm not sure what the correct behaviour should be (whether the library should apply URL escaping or leave it to the caller (so that the library always expects @id fields in the escaped format) ). JSON-LD Playground which is a good reference point behaves in the same way as json-gold.

While I'm looking at the possible solution, I'll give an unhelpful suggestion, based on my experience with writing financial services software 😄 : even if the library is producing well formed N-tuples, I'm afraid there will be problems with such identifiers downstream. I would highly recommend using 'safe' identifiers without characters like >, and moving the actual identifier into a separate field.

gkellogg commented 3 years ago

Looking at the IRI Syntax from RFC3987, "LDA>1Y" would be an isegment part of an ipath, and ">" is not a valid icharacter, so must be escaped. The spec depends on the use of valid IRIs, and a processor may reject invalid IRIs or relative IRI references (such as this).

My on parser (available at http://rdf.greggkellogg.net/distiller) is happy to expand this, but doesn't generate N-Triples because of the invalid IRIs.

I used the following as an example:

{
  "@context": {
    "@base": "http://example.com/BOE/code/INSTRUMENTS",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "ex": "https://example.com/",
    "skos:notation": {"@type": "@id"}
  },
  "@id": "ex:BOE/code/INSTRUMENTS/LDA>1Y",
  "@type": "skos:Concept",
  "skos:prefLabel": "Medium and long term deposits",
  "skos:notation": "LDA>1Y"
}
alexkreidler commented 3 years ago

Thanks for both your responses.

For my situation I can do a check to make sure the @id is valid, and either just omit the bad records or think about URL-encoding or shortening those IDs.

It would be interesting to see if json-gold could do a check to make sure it's not serializing invalid IRIs. Rdflib does this Of course, we wouldn't want it to hurt performance, so maybe it could be optional?