w3c / json-ld-api

JSON-LD 1.1 Processing Algorithms and API Specification
https://w3c.github.io/json-ld-api/
Other
76 stars 31 forks source link

relative iri compaction #435

Open roptat opened 4 years ago

roptat commented 4 years ago

Hi,

in compaction test 0066, some IRIs are to be compacted relative to the document base IRI. The base IRI is https://w3c.github.io/json-ld-api/tests/compact/0066-in.jsonld and an examples of IRIs to compact is https://w3c.github.io/absolute. The expected result is ../../../absolute, however, /absolute seems to be valid and more compact. Why not compact IRIs more when possible? One could simply choose the shortest between the path of the iri to compact, and the relative path with '..'s.

roptat commented 4 years ago

also, test 0076 expects http://example.com/api/things/1 to be compacted as 1 when the base is itself. Wouldn't the empty string work too and be more compact?

gkellogg commented 4 years ago

Hi,

in compaction test 0066, some IRIs are to be compacted relative to the document base IRI. The base IRI is https://w3c.github.io/json-ld-api/tests/compact/0066-in.jsonld and an examples of IRIs to compact is https://w3c.github.io/absolute. The expected result is ../../../absolute, however, /absolute seems to be valid and more compact. Why not compact IRIs more when possible? One could simply choose the shortest between the path of the iri to compact, and the relative path with '..'s.

IRI compaction for document-relative IRIs defaults to doing Relative IRI reference resolution as described in RFC3986/7. There is a normalization algorithm that reduces such IRIs to their minimal form, but it is not called for in these algorithms.

The Relative Resolution algorithm, which must be used, is described in RFC3986 Section 5.2. There are a number of subtlties and the test suite has even recently had more tests introduced to probe some of the corner cases.

RDF syntaxes (such as JSON-LD) treat IRI/URIs which may have an equivalent normalized representation as different, so introducing normalization as part of the IRI compaction process would violate this.

gkellogg commented 4 years ago

also, test 0076 expects http://example.com/api/things/1 to be compacted as 1 when the base is itself. Wouldn't the empty string work too and be more compact?

This is been in since 1.0, and it's hard to find any explicit requirement to do this, but it can be inferred from looking at RFC3986 5.2.3 Merge Paths, for which this is the operate operation. In that case, the portion of the path after the last "/" is discarded, which is what's going on here.

Also, intuitively, if you had a base of "http://example.com/api/things/1", you could either compact that to "" or "1", but if you wanted to compact "http://example.com/api/things/2", it could only compact to "2", which would be inconsistent.

roptat commented 4 years ago

I don't really understand the answer. When implementing the expansion algorithm, I indeed saw that the RFC3986 Relative Resolution algorithm was used, and I implemented it. However, IIUC, this algorithm is one-way only: it takes a base and a relative reference, and gives you a new absolute IRI. However, in the IRI compaction algorithm, we want to do the reverse: get a relative IRI reference from an absolute IRI. I think the exact algorithm used to perform that operation is missing from the specification, hence my questions.

gkellogg commented 4 years ago

I agree that the spec could be more explicit in how to perform this operation, but it is quite late to introduce such an algorithm as we're ending the Candidate Recommendation period. The adherence to the test suite is what determines conformance, and sometimes this requires "reading between the lines". We can defer adding to spec text to a future version, which could come fairly soon after the release of the final recommendation.

dlongley commented 4 years ago

@roptat,

Wouldn't the empty string work too and be more compact?

The goal of compaction is not to make the data size as small as possible without regard for its semantics; it is not "compression" like gzip. Rather, compaction enables the data to be more readily parsed and understood by humans or programs that are expecting it to conform to a certain context.

timothee-haudebourg commented 4 years ago

I think I'm suffering from the same imprecision. One particular case that I don't understand is that http://example.com/api/things/1 is compacted into 1, but http://example.com/api/things/1#foo is compacted into #foo. Following your logic, isn't it inconsistent with http://example.com/api/things/2 being compacted into 2#foo?

Instead I would expect http://example.com/api/things/1#foo to be compacted into 1#foo. But this is not what is expected by compaction test #t0066.

gkellogg commented 4 years ago

It's common for fragment identifiers to be appended to IRIs, and the base IRI, so that when compacting you get URIs of the form #foo, but it certainly would carry the same semantics if it were compressed to 1#foo; that's just not how 1.0 implementors interpreted step 10 of IRI Compaction.

I don't believe any other RDF specs describe how IRIs should be compacted, and we used our best understanding to come up with a consistent interpretation, but as I said, the text in the spec doesn't make this explicit.

My own implementation uses remove_base, which is described here:

def remove_base(base, iri)
  return iri unless base
  @base_and_parents ||= begin
    u = base
    iri_set = u.to_s.end_with?('/') ? [u.to_s] : []
    iri_set << u.to_s while (u = u.parent)
    iri_set
  end
  b = base.to_s
  return iri[b.length..-1] if iri.start_with?(b) && CONTEXT_BASE_FRAG_OR_QUERY.include?(iri[b.length, 1])

  @base_and_parents.each_with_index do |bb, index|
    next unless iri.start_with?(bb)
    rel = "../" * index + iri[bb.length..-1]
    return rel.empty? ? "./" : rel
  end
  iri
end

Basically, it has a rule to return the fragment or query appended to base, if defined, otherwise, it uses '../' sequences added to ipath-absolute minus any trailing isegement of the base IRI, as necessary.

This could be detailed in an erratum to the API spec, in appropriate algorithmic speak.