kit-data-manager / wap-server

Apache License 2.0
4 stars 3 forks source link

requesting an annotation via HTTP yields broken string escaping #7

Open ben-tinc opened 9 months ago

ben-tinc commented 9 months ago

Current behavior

Consider a string as the following:

"{\"text\":\"Hello \\\"world\\\"!\"}"

It is a valid string serialization of a javascript object and as such usable with javascript's JSON.parse(). Specifically, it is a serialization of the object

{ text: 'Hello "world"!' }

As a regular string, the serialization is also a valid value for a TextualBody of a web annotation. Using the wap-servers webapp, we can easily create the following annotation:

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "type": "Annotation",
  "body":  {
    "type": "TextualBody",
    "value": "{\"text\":\"Hello \\\"world\\\"!\"}",
    "purpose": "tagging"
  },
  "target": "http://example.com/page1"
}

Let us assume the resulting annotation has the URI "http://localhost:8889/wap/TestContainer/a218953d-192b-4074-96d2-be3f33d07ec2". Accessing it in the browser (and choosing "raw" output) yields the following:

{
  "@context" : "http://www.w3.org/ns/anno.jsonld",
  "id" : "http://localhost:8889/wap/TestContainer/a218953d-192b-4074-96d2-be3f33d07ec2",
  "type" : "Annotation",
  "created" : "2024-02-01T11:07:27Z",
  "modified" : "2024-02-01T11:07:27Z",
  "body" : {
    "type" : "TextualBody",
    "value" : "{\"text\":\"Hello \"world\"!\"}",
    "purpose" : "tagging"
  },
  "target" : "http://example.com/page1"
}

As we can see, the value of the TextualBody has changed. Every instance of " is now only escaped once. Incidentally, this also means that the string is no longer a valid JSON serialization.

Let's instead use SPARQL to query the same annotation:

PREFIX oa: <http://www.w3.org/ns/oa#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?value {GRAPH ?g {
        <http://localhost:8889/wap/TestContainer/a218953d-192b-4074-96d2-be3f33d07ec2> a oa:Annotation.
        <http://localhost:8889/wap/TestContainer/a218953d-192b-4074-96d2-be3f33d07ec2> oa:hasBody ?b1 .
        ?b1 oa:hasPurpose oa:tagging .
        ?b1 rdf:value ?value.      
  }
}

Store the above query as query.sparql and use e.g.

curl -X POST "http://localhost:3330/wap/sparql" -H "Content-Type: application/sparql-query" -H "Accept:application/sparql-results+json"  -d "@query.sparql"

The result is

{
  "head": {
    "vars": [ "value" ]
  } ,
  "results": {
    "bindings": [
      {
        "value": { "type": "literal" , "value": "{\"text\":\"Hello \\\"world\\\"!\"}" }
      }
    ]
  }
}

So we can see that the correct, unchanged string is still available in the triple store. However, accessing it via HTTP modifies the string, breaking the escaping in the process.

Similarly, string escaping is broken when an annotation contains escaped newlines. I would imagine that every kind of escaping is liable to be affected, but " and \n are the ones we are encountering in practice.

Expected behavior

String values should get retrieved without modification.

Thanks for your consideration. Please let me know if you need more details.

ben-tinc commented 7 months ago

I can confirm that https://github.com/GGoetzelmann/wap-server/commit/a5be1c7bcef8979e74062e7c48bc3b482d3799bf fixes all the issues we are seeing with escaping.

Thanks a lot! :)