Open jmakeig opened 7 years ago
One advantage to materialization over query rewriting is that you can annotate the snapshot of the data with its own metadata, making it straightforward to maintain as a coherent whole and trace its lineage as things change. With query rewriting you need to get the state of the instance data and the model and the query rewrite rules in order to query in the past or move the data around. Probably not a big deal if we assert that query rewriting is (mostly) an interim state.
I’d also like to think about a common interface and then a spectrum, in terms of the implementation. At one end, everything is resolved dynamically at runtime out of the model while at the other end, everything queryable is materialized instances. Is there an abstraction above this that we could provide developers and allow tuning the tradeoffs dynamically underneath?
One advantage to this "query indirection" is that while the model is in a state of flux (it's changing very often), it's very easy/efficient to change the model and keep working. This can be part of a flow:
query for 94111
anywhere in any document
query for zip_code=94111 OR postal=94111
model all the places you might find a zip, probably as triples, as described above; expand each query This is a lot like semantic search or synonym expansion, except that instead of expanding the value you're looking for, you're expending the places where you're looking for it.
push the model into the documents using the envelope pattern
This can be a continuum as you find out more about your data.
Of course the end state may well be a mix of 3 and 4, where most of the useful aspects of the model have been discovered and pushed into documents, but there's still some level of change going on at the model artifact that hasn't (yet) been pushed down to the document.
What about relationships that are more nuanced than “same as”? Presumably you’d have to tell the query rewriter the meaning of your predicates and how to rewrite them. “Location is defined by ZIP Code except for the cases where ZIP Code is missing and then it’s Region or if it’s a foreign company, then use postal_code
in England, …”
I think Kurt Cagle had proposed annotating Semantic queries with code “callbacks” for this type of thing.
I find that most code built on top of cts:query
/search:search
expects there to be a value materialized on ingest, e.g. the faceting code in search:search
requires one range indexed item (could be a field index, but still, it wants some single index in the db). So I think it pays to be opinionated about harmonization and say it should ideally be done at ingest with "schema on write." Other places we expect a single value or single range index may include the Entity modeler in the v.2 of the data hub framework, where a filed is graphically modeled and marked in various ways (indexed, facet, required, primary key). I'm sure there are many libraries that assume this, and can't do an implicit OR.
It is still a big advantage that you can query new data via an OR whenever you need to, but the "one way to do it" should be to materialize.
So if we wanted to be able to assert equivalence through rules, it would be ideal if that showed up transparently as a synthetic single element, usable via cts.query
, XPath, and all the rest. This may be doable via a new kind of TDE (though there may be a reindexing delay before it is usable). It occurs to me that registered queries are much like a lazy way to build this faux-element lazily as it is queried. If the faux-element could be pushed into the list cache and even optionally into a range index as needed, that would be pretty neat. I imagine it would be computed from termlists rather than full docs so that it is fast enough to do at runtime without reading every document and walking all the elements. No idea if something like that could acutually work, though.
In contrast, the RDF/SPARQL world is all about asserting equivalence via ontologies or at query time. So perhaps RDF is a path of least resistance to query multiple values based on a semantic assertion of equivalence.
You're thinking backwards @jmakeig, from "if I had a relationship X" to ".... then what could I do with it?" We should be thinking "what would I want to do with relationships other than sameAs?" One could imagine at least sub-predicates and super-predicates. So I want to search for "London" in every field that's a "collectiveDwellingPlace" (hamlet, village, town, city, metropolis)
@damonfeldman, I think you're missing the point of this extra step. It's a level of query indirection on purpose, so that the model can be changed quickly and easily (with minimal impact). As soon as you have a materialized field in the document, then any time you make a change you have to change (and re-index) the document. With a level of indirection, you can change the model and have query results change instantly, with no document changes or reindexing. Then at some point you want to trade-off that additional indexing work (on a model/documentset-shape that is relatively stable) and add the model information to documents.
When @sbuxton and I discussed this issue last week, I think that it underscored the difference between discovery scenarios and production scenarios. The data hub will, I hope, encompass both.
The visions for tooling that I‘ve seen so far include a kind of interface to source data, which can drill into data arbitrarily, and give enough feedback to the expert so that they can plan how to build productive envelopes, identify which transforms and data negotiations are required among sources, what data is poor or good.
Query rewriting probably is a great implementation of how to make a model that can "peek into" sources without having to do the heavy lifting of transforms, writing documents, and indexing.
I would caution against any single system that can be both production services and a discovery UI. Discovery is a kind of ancillary activity that can be done on a sample of data, then inform a production pipeline of how it should integrate new data or change over time. In other words, the data hub that we've seen so far (and it had to come first, yes) is how companies get data integration at scale, while a discovery use case provides the tooling to get data integration with high velocity of change.
Agreed, that it’s helpful to differentiate between discovery in staging and delivery in production.
As a data architect, I want to be able to model my physical source data to describe its properties and relationships. Given two representations of the same concept in two sources I want to be able to associate those as related, for example,
zip_code
in source A is the same aspostal
in source B. By specifying this much information, the database should be able to rewrite a query that asks forA.zip_code
orB.postal
asA.zip_code OR B.postal
, similar to how synonyms work for text. (What about for relationships that aren’t “same as”?) When I get a new source, C, I can add another assertion that says which field is the same asA.zip_code
orB.postal
and the existing queries will “just work®”.