Pattern Matching - Githubissues

AtesComp commented 3 years ago

Modify the "Value Pattern Matching Algorithm" to match on regular expressions.

NOTE: I've split this off from #73 as I believe this is a useful framing issue on its own.

From the framing specification:

4.3 Value Pattern Matching Algorithm

The Value Pattern Matching Algorithm is used as part of the Framing and Frame Matching algorithms. A value object matches a value pattern using the match none and wildcard patterns on @value, @type, and @language, in addition to allowing a specific value to match a set of values defined using the array form for each value object property.

It seems matching is an all or nothing affair: match anything {}, nothing [], or everything (a specific value). There is no middle ground for generic string matching. It would be useful for matching to use regex patterns. Example:

"ex:relationship-.+"

This is helpful for a wide range of use cases. For example, when there is no @type in the input, but the @id may contain information that can be used to infer type, then a partial string match within the @id can identify the default type. It can be very useful for matching on properties (2.1.1). Then, 4.3 Value Pattern Matching Algorithm becomes much more robust.

RELATED: The JSON Schema specification that uses the pattern keyword for regular expressions: https://json-schema.org/understanding-json-schema/reference/regular_expressions.html

OWL2 also reserves xsd:pattern for regex and uses it in restrictions.

PROPOSITION: Extend JSON-LD Framing with the @pattern keyword.

[FRAME]
{
  "@context": {"@vocab": "http://example.org/"},
  "@id": {"@pattern": ".*\/[Ll]ibrary\/.*"},
  "@type": {"@default": "Library"},
  "contains": {
    "@id": {"@pattern": ".*\/[Bb]ook\/.*"},
    "@type": {"@default": "Book"},
    "contains": {
      "@id": {"@pattern": ".*\/[Cc]hapter\/.*"}
      "@type": {"@default": "Chapter"}
    }
  }
}

@pattern should accept an array of patterns.

[FRAME]
{
  "@context": {"@vocab": "http://example.org/"},
  "@id": {"@pattern": [
    ".*\/[Ll]ibrary\/.*",
    ".*\/[Aa]thenaeum\/.*",
    ".*\/[Bb]ook_?[Cc]ollection\/.*"]
  }
  "@type": {"@default": "Library"},
  "contains": {
    "@id": {"@pattern": ".*\/[Bb]ook\/.*"},
    "@type": {"@default": "Book"},
    "contains": {
      "@id": {"@pattern": ".*\/[Cc]hapter\/.*"}
      "@type": {"@default": "Chapter"}
    }
  }
}

or we could just use regex or'ing, |, but it might be nice to include such constructs.

[FRAME]
...
  "@id": {"@pattern": ".*\/([Ll]ibrary|[Aa]thenaeum|[Bb]ook_?[Cc]ollection)\/.*"}
...

How about matching property names, not just the values? Then, 2.1 Framing becomes much more robust and we can do some interesting things like shaping based on property patterns. In the following case, typing based on property patterns and relations:

[FRAME]
{
  "@context": {"@vocab": "http://example.org/"},
  "@type": {"@default": "Library"},
  "location": {"@pattern": "[Aa]thens(, (Greece|Tennesee, USA))?"},
  "contains": [
    {
      "@type": {"@default": "Book"},
      {"@pattern": ".*([Cc]reator|[Aa]uthor).*"}: {},
      "contains": {
        "@id": {"@pattern": ".*\/[Cc]hapter\/.*"},
        "@type": {"@default": "Chapter"}
      }
    },
    {
      "@type": {"@default": "Periodical"},
      {"@pattern": ".*([Cc]reator|[Pp]ublisher).*"}: {},
      "contains": {
        "@id": {"@pattern": ".*\/[Aa]rticle\/.*"},
        "@type": {"@default": "Article"}
      }
    }
  ]
}

gkellogg commented 3 years ago

Something to consider for a future version. It would be good to collect use cases where this is important.

note that you could achieve much better matching, in even more cases, by doing a SPARQL CONSTRUCT and then framing the resulting graph.

AtesComp commented 3 years ago

Thanks for the response and consideration.

I've seen other comments about SPARQL CONSTRUCT statements being used. This presumes the original data is accessible via a SPARQL store. However, my use case has no such origin. The data is ingested from an otherwise non-linked data source--raw JSON--and transformed into a standardized JSON-LD format for an RDF store. I could throw it into a temp store and do all the work there, but then why would anyone need JSON-LD over JSON? JSON-LD does 90% of what I'm attempting, so...

I'm attempting to use the JSON-LD standard as a transformative ingester on raw JSON by adding @context, using a framing @context to transform the data to the expected format, and using the JSON-LD standard algorithms to effect transformation by adding missing type with simple inference on structure and content using regex. The JSON-LD specification is so very close to helping me achieve that goal...frustratingly so. I see in the code base that it is using regex to find the specification's internal structures in the data for framing, but doesn't allow for general regex use. I could easily fork and implement as needed, but would rather have an open discussion on the merits.

The basic issue here is general matching for framing. The specification gives us none ([]), all ({}), or exact (a list of specific strings), but no general bridge between "all" and "exact"...no subset other than a list of specific strings. Via the standard, I would need to list all specific matching strings--an unbearable thought. A regex match would solve this issue.

I find it humorous that the proposed solution would use regex to find a regex string to apply regex to a predicate or value.

gkellogg commented 3 years ago

JSON-LD Framing is not intended to be a generic query language and attempts to make it such have been resisted. As JSON-LD is an RDF serialization format, SPARQL does, indeed, provide a way of performing generic query. SPARQL is used for querying data, either contained in a triple store, or as represented in an RDF document (e.g., using the FROM clause in SPARQL).

Depending on the JSON-LD library you're using, there is likely a fuller suite of RDF tools available to you and the workflow would be like the following:

1) retrieve JSON-LD document, 2) transform to RDF dataset using the FromRDF API, or similar, 3) query the dataset using an appropriate SPARQL CONSTRUCT query, 4) retrieve the resulting RDF graph using a JSON-LD format, 5) frame that result using JSON-LD Framing.

Note that this can typically be done straight from the SPARQL query by referencing the JSON-LD document directly from the FROM clause. By using an appropriate ACCEPT header, or other library-specific option, you can do this entirely from the SPARQL query. IIRC, dydra has a way to specify a frame to use when returning the results of a query as JSON-LD (cc, @lisp).

There are other methods that have varying degrees of acceptance, see for example, GraphQL-LD, for which @rubensworks may have something to say.

AtesComp commented 3 years ago

Agreed. Not a query language. Just a transformation step from JSON to JSON-LD...something I can work with.

w3c / json-ld-framing

Pattern Matching #118