marklogic / entity-services

Data modeling and code scaffolding for data integration in MarkLogic
https://docs.marklogic.com/guide/entity-services
Apache License 2.0
7 stars 10 forks source link

Semantic query rewriting #294

Open jmakeig opened 7 years ago

jmakeig commented 7 years ago

As a data architect, I want to be able to model my physical source data to describe its properties and relationships. Given two representations of the same concept in two sources I want to be able to associate those as related, for example, zip_code in source A is the same as postal in source B. By specifying this much information, the database should be able to rewrite a query that asks for A.zip_code or B.postal as A.zip_code OR B.postal, similar to how synonyms work for text. (What about for relationships that aren’t “same as”?) When I get a new source, C, I can add another assertion that says which field is the same as A.zip_code or B.postal and the existing queries will “just work®”.

jmakeig commented 7 years ago

One advantage to materialization over query rewriting is that you can annotate the snapshot of the data with its own metadata, making it straightforward to maintain as a coherent whole and trace its lineage as things change. With query rewriting you need to get the state of the instance data and the model and the query rewrite rules in order to query in the past or move the data around. Probably not a big deal if we assert that query rewriting is (mostly) an interim state.

I’d also like to think about a common interface and then a spectrum, in terms of the implementation. At one end, everything is resolved dynamically at runtime out of the model while at the other end, everything queryable is materialized instances. Is there an abstraction above this that we could provide developers and allow tuning the tradeoffs dynamically underneath?

jmakeig commented 7 years ago

Related to https://github.com/marklogic/marklogic-data-hub/projects/1#card-1745528.

sbuxton commented 7 years ago

One advantage to this "query indirection" is that while the model is in a state of flux (it's changing very often), it's very easy/efficient to change the model and keep working. This can be part of a flow:

  1. query for 94111 anywhere in any document

    • no model necessary
    • can do this right way, after you've loaded as-is
    • as the model changes (more sources/document shapes come in) this query continues to work
    • downside: you'll also find documents with 94111 as part of a phone number or other field
  2. query for zip_code=94111 OR postal=94111

    • the model is encoded in every query
    • as the model changes you need to change every query
  3. model all the places you might find a zip, probably as triples, as described above; expand each query This is a lot like semantic search or synonym expansion, except that instead of expanding the value you're looking for, you're expending the places where you're looking for it.

    • the model is now a separate artifact
    • as the model changes, you just change that artifact and queries work with the new model/sources
    • the artifact is available for you to examine, and to use for things like step 4
  4. push the model into the documents using the envelope pattern

    • use the model "artifact" you created in 3
    • now it's more work whenever the model changes (since documents need to be changed/reindexed), BUT queries are very efficient (there's no query indirection)

This can be a continuum as you find out more about your data.

Of course the end state may well be a mix of 3 and 4, where most of the useful aspects of the model have been discovered and pushed into documents, but there's still some level of change going on at the model artifact that hasn't (yet) been pushed down to the document.

jmakeig commented 7 years ago

What about relationships that are more nuanced than “same as”? Presumably you’d have to tell the query rewriter the meaning of your predicates and how to rewrite them. “Location is defined by ZIP Code except for the cases where ZIP Code is missing and then it’s Region or if it’s a foreign company, then use postal_code in England, …”

I think Kurt Cagle had proposed annotating Semantic queries with code “callbacks” for this type of thing.

damonfeldman commented 7 years ago

I find that most code built on top of cts:query/search:search expects there to be a value materialized on ingest, e.g. the faceting code in search:search requires one range indexed item (could be a field index, but still, it wants some single index in the db). So I think it pays to be opinionated about harmonization and say it should ideally be done at ingest with "schema on write." Other places we expect a single value or single range index may include the Entity modeler in the v.2 of the data hub framework, where a filed is graphically modeled and marked in various ways (indexed, facet, required, primary key). I'm sure there are many libraries that assume this, and can't do an implicit OR.

It is still a big advantage that you can query new data via an OR whenever you need to, but the "one way to do it" should be to materialize.

So if we wanted to be able to assert equivalence through rules, it would be ideal if that showed up transparently as a synthetic single element, usable via cts.query, XPath, and all the rest. This may be doable via a new kind of TDE (though there may be a reindexing delay before it is usable). It occurs to me that registered queries are much like a lazy way to build this faux-element lazily as it is queried. If the faux-element could be pushed into the list cache and even optionally into a range index as needed, that would be pretty neat. I imagine it would be computed from termlists rather than full docs so that it is fast enough to do at runtime without reading every document and walking all the elements. No idea if something like that could acutually work, though.

In contrast, the RDF/SPARQL world is all about asserting equivalence via ontologies or at query time. So perhaps RDF is a path of least resistance to query multiple values based on a semantic assertion of equivalence.

sbuxton commented 7 years ago

You're thinking backwards @jmakeig, from "if I had a relationship X" to ".... then what could I do with it?" We should be thinking "what would I want to do with relationships other than sameAs?" One could imagine at least sub-predicates and super-predicates. So I want to search for "London" in every field that's a "collectiveDwellingPlace" (hamlet, village, town, city, metropolis)

sbuxton commented 7 years ago

@damonfeldman, I think you're missing the point of this extra step. It's a level of query indirection on purpose, so that the model can be changed quickly and easily (with minimal impact). As soon as you have a materialized field in the document, then any time you make a change you have to change (and re-index) the document. With a level of indirection, you can change the model and have query results change instantly, with no document changes or reindexing. Then at some point you want to trade-off that additional indexing work (on a model/documentset-shape that is relatively stable) and add the model information to documents.

grechaw commented 7 years ago

When @sbuxton and I discussed this issue last week, I think that it underscored the difference between discovery scenarios and production scenarios. The data hub will, I hope, encompass both.

The visions for tooling that I‘ve seen so far include a kind of interface to source data, which can drill into data arbitrarily, and give enough feedback to the expert so that they can plan how to build productive envelopes, identify which transforms and data negotiations are required among sources, what data is poor or good.

Query rewriting probably is a great implementation of how to make a model that can "peek into" sources without having to do the heavy lifting of transforms, writing documents, and indexing.

I would caution against any single system that can be both production services and a discovery UI. Discovery is a kind of ancillary activity that can be done on a sample of data, then inform a production pipeline of how it should integrate new data or change over time. In other words, the data hub that we've seen so far (and it had to come first, yes) is how companies get data integration at scale, while a discovery use case provides the tooling to get data integration with high velocity of change.

jmakeig commented 7 years ago

Agreed, that it’s helpful to differentiate between discovery in staging and delivery in production.