Blank nodes in the test suite

pjohnston-wiley commented 6 years ago

A number of tests appear to be excessively prescriptive about the use of blank nodes, which will make some implementations fail.

For example, frame-0010-out.jsonld. This specifies a result with:

"dc:creator": {
      "@id": "_:b0",
      "foaf:name": "John Doe"
    }

This implicitly mandates that compliant json-ld processors use a specific algorithm _:b# to generate blank nodes. This isn't the only way to generate blank nodes: hashing the node could be a viable alternative, as could a UUID, for example.

The reason i came on this one is that the current incarnation of pyld is failing this test, because it defaults pruneBlankNodeIdentifiers to true, which is correct, so the test is actually wrong. I also realize this is being removed entirely for 1.1, so the test needs to get amended because of that.

However, the problem will still be there for blank nodes that are referenced more than once (i am not sure if a test for this already exists). There may need to be an instruction to test suite implementers that singleton blank node identifiers should not be tested.

It also raises the issue that the suite is not testing processor options other than the default, which seems to be a gap, unless these are all going away in 1.1? One possibility I suggested in issue 641 is allowing processor options to be embedded in the frame json itself. Alternatively, this could be an optional file in the test suite.

gkellogg commented 6 years ago

The API spec is explicit about how to generate blank node identifiers. You're correct that semantically, the identifiers are not important, but because the generated JSON needs to be compared as structurally, the spec proscribes an algorithm and naming scheme for creating new blank node identifiers. This really just comes into play when flattening (and generating RDF).

Note that tests are marked with different specification levels and different version levels, so that both the 1.0 and 1.1 behaviors can be validated.

The suite does allow for different options to be specified, and does so in many cases. If there is missing coverage, please suggest a new test, or create a pull request.

pjohnston-wiley commented 6 years ago

I think i was looking in the wrong place in the spec, i hadn't thought to look at flattening. The prescriptive approach makes sense in retrospect.

I am actually trying to put together a test to cover a bug in pyld, which is how i got here.

I admit finding stuff in the tests to see what is already covered is a bit challenging, especially finding the manifest files buried among the tests (if you are using an IDE as i am this is a particular challenge). Could we consider reorganizing into subfolders? I would be happy to take a stab, unless spec implementers can't deal with subfolders? I hadn't realized the manifest files were a lot more than manifests and include processing instructions, i might have realized that more readily if i could see them more obviously.

On the issue at hand and in the spirit of devising a useful test, i was playing with the playground and this sample:

{
  "@context": {
  "@vocab": "http://example2.com/",
  "fred": "http://example1.com/fred",
  "aaa": "http://example1.com/"
  },
  "fred": {
    "@id": "_:b1"
  },
  "barney": {
  },
  "wilma": {
    "@id": "_:b1"
  },
  "betty": {
    "@id": "_:b0"
  },
  "http://example1.com/bob": {
  }
}

The intent here was to understand how blank node identifiers creep into the json-ld processors. I also figured it would be a good test to detect lazy implementations that fail to rewrite blank node identifiers. The good news is that the dev playground, the outputs there appear to conform with the algorithm descriptions, so any such tests should be benign.

The one that confused me was expansion, since all the others appear to sort after the algorithm has been applied, where expansion sorts first and then expands. I am not sure why it needs to sort at all, or why it shouldn't sort after the fact. It doesn't make a whole lot of difference since expansion is mostly a waystation, but it would be more consistent if the sort happened last.

If i am troubleshooting, it is a lot easier if the results either preserve the sort order of the input, or they sort the output once it has been produced. In terms of processing overhead, i would opt for the former, in terms of intuitiveness for the latter.

This is what the dev playground does:

[
  {
    "http://example2.com/barney": [
      {}
    ],
    "http://example2.com/betty": [
      {
        "@id": "_:b0"
      }
    ],
    "http://example1.com/fred": [
      {
        "@id": "_:b1"
      }
    ],
    "http://example1.com/bob": [
      {}
    ],
    "http://example2.com/wilma": [
      {
        "@id": "_:b1"
      }
    ]
  }
]

This is what sorting after the fact would produce:

[{
    "http://example1.com/bob": [{}],
    "http://example1.com/fred": [{
        "@id": "_:b0"
    }],
    "http://example2.com/barney": [{}],
    "http://example2.com/betty": [{
        "@id": "_:b1"
    }],
    "http://example2.com/wilma": [{
        "@id": "_:b0"
    }]
}]

gkellogg commented 5 years ago

@pjohnston-wiley I think this was resolved by an update to the Generate Blank Node Identifier algorithm to only suggest how blank node identifiers are made, rather than mandate, and by changes to the test suite README for testing the flatten algorithm.

json-ld / json-ld.org

Blank nodes in the test suite #661