marklogic / entity-services

Data modeling and code scaffolding for data integration in MarkLogic
https://docs.marklogic.com/guide/entity-services
Apache License 2.0
7 stars 10 forks source link

Add additional elements to the envelope structure #143

Open paxtonhare opened 8 years ago

paxtonhare commented 8 years ago

The Data Hub Framework would like to see es:triples and es:headers added to the es:envelope.

The current envelope in the hub looks like:

<envelope>
  <headers></headers>
  <triples></triples>
  <content></content>
</envelope>

content already maps to es:instance. The triples and headers will make our plan for world domination complete.

jmakeig commented 8 years ago

@paxtonhare What is headers used for?

Just to be clear, the envelope that Entity Services prescribes is just the default. You can change it however you see fit in the transformation code. Agreed, though if we could make the defaults more applicable, users would have to do less modification of the out-of-the-box transforms.

paxtonhare commented 8 years ago

The headers are there to let users store arbitrary key/value data that doesn't fit elsewhere.

On Thu, Sep 1, 2016 at 3:42 PM, Justin Makeig notifications@github.com wrote:

@paxtonhare https://github.com/paxtonhare What is headers used for?

Just to be clear, the envelop that Entity Services prescribes is just the default. You can change it however you see fit in the transformation code. Agreed, though if we could make the defaults more applicable, users would have to do less modification of the out-of-the-box transforms.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marklogic/entity-services/issues/143#issuecomment-244189682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmebpEr4P0nvOANmKb802eITcrWbqE0ks5qlyq0gaJpZM4Jy-wY .

grechaw commented 8 years ago

We will add es:headers, es:triples, but not es:content.

jmakeig commented 8 years ago

Whichever envelope format we come up with, it’s only the default—a user could write her own. However, the out-of-the-box implementation is a key indicator of how to think about envelopes and why you need them. Thus we should try to provide the primary use cases out-of-the-box.

As a data steward, Stewart needs to track various dimensions of the data in order to capture everything he knows about a particular entity.

In general, these fall into three categories:

Any of these may be represented as JSON, XML, or triples.

<es:envelope>
  <es:instance>…</es:instance>
  <es:view type="http://example.com/Patient#Billing">…</es:view>
  <es:view type="http://example.com/Patient#QualityOfCare">…</es:view>
  <es:metadata>…</es:metadata>
</es:envelope>

(Note: These names are placeholders for the sake of this discussion. This issue is to decide the names of these things.)

It also might be the case where you want to scope metadata to a particular canonical instance or source.

<es:envelope>
  <es:instance>…</es:instance>
  <es:view type="http://example.com/Patient#Billing">
    <es:metadata>…</es:metadata>
    …
  </es:view>
  <es:view type="http://example.com/Patient#QualityOfCare">…</es:view>
  <es:metadata></es:metadata>
</es:envelope>
jmakeig commented 8 years ago

Now that I wrote that down, isn’t <es:instance/> just a specialized case of <es:view/>? Wouldn’t Stewart want to model the QualityOfCare view just like the canonical (and then do model-to-model) transformation?

(Copying @damonfeldman too.)

damonfeldman commented 8 years ago

The idea of multiple views is very interesting. Instead of naming sections after their likely use (source, metadata, etc.) it asserts no opinion about use and simply states that they are alternate presentations for the same data (view type=x, view type=y). The distinctions are now system distinctions - what corresponds to your entity definition (es:instance), and what does not.

I want to call out the use of different envelope sections that is perhaps the most helpful in the field though. Often the key distinction is that one section is coupled to external consumers and the other is for internal database use (indexable forms, materialized counts, structures query-able as unique elements without positional indexes, etc.)

In that case, external clients only depend on the "content" which allows the MarkLogic/database team to quickly add new fields or change things in the other ("header") section, without risking a breaking change or coordinating with other teams.

jmakeig commented 8 years ago

Often the key distinction is that one section is coupled to external consumers

Isn’t that the whole point of views? To maintain a static interface, even if the physics data doesn’t necessarily reflect that.

damonfeldman commented 8 years ago

Yes. The externally-visible form is certainly a view. The internal section is more "the API to the indexes."

bsrikan commented 8 years ago

Copying self @bsrikan to the issue

jmakeig commented 8 years ago
damonfeldman commented 8 years ago

After our discussions today, I'm starting to feel more comfortable with the original envelope naming. (headers, triples, content). The reason is that it conveys a sense that the is the main thing, but you can add more in , without assuming the particular use.

Using the data hub framework as an example, the Raw/Staging database is "about" raw data. So the content is raw data there representing records as they came in from a source. But the Final database is "about" something else (often serv-able, canonical data). To continue the example, the raw data would likely use for metadata and linkage to the sources; harmonized would be very different and would likely use for derived data or search hints, or materialized views (averages, counts, risk scores or similar).

Long story short, I'm leaning toward some fuzzy notion that conveys that there is one section that is primary, and another that is somehow secondary. But primary and secondary mean very different things in different contexts. (so headers="secondary" and content="primary" to be a little more concrete but not dictate use).

I could see a "metadata" section added - though that can always be nested as a child of the header as a child if appropriate.

jmakeig commented 8 years ago

I feel strongly that the wrappers should convey the purpose of the wrapped data. headers, triples, content is all over the place, mixing format and position. Yes, you can do anything with MarkLogic. However, the point here is to be prescriptive about a way. We identified four key use cases above for why you’d want to segregate data in an envelope:

Index and serve are two sides of the same coin: Materialize some stuff so we don’t have to figure it out at query time. How do we figure out which “stuff”?

  1. With some custom code
  2. By projecting out of an entity model (and then running the upstream data through some generated code)

Thus, I propose that we combine those two under the instance entity rubric.

<es:envelope>
  <es:entity type="…">
    <es:metadata>…</es:metadata>
    <erp:BillingParty>…</erp:BilllingParty> 
  </es:entity >
  <es:entity type="…">
    <es:metadata>…</es:metadata>
    <qc:Patient>…</qc:Patient> 
  </es:entity>
  <es:source>…</es:source>
  <es:metadata>…</es:metadata>
</es:envelope>

What I hadn’t fully appreciated until recently is that a conceptual entity can have multiple materializations, to either support indexing or serving. In the simplest case you have one source, one type, and one instance. (The current Entity Services implementation makes some assumptions about having one es:instance es:entity element.) However, MarkLogic can handle multiple sources and types with aplomb.

In an ideal world, you wouldn't materialize instances at all. Rather, you’d do model-to-model translations at runtime. (This is how TDE works, in fact.) However, we know that does’t scale for the general case of documents-to-documents.

damonfeldman commented 8 years ago

I'm a little unclear about what an es:instance is. I think of it as data that conforms to a defined model, as distinct from data that is handled less formally and does not have a model defined. To avoid BMOF (big modeling up front) I think it is important to have both, so in that view many things would not be instances. Is that accurate?

I actually like <index>, <serve>,<instance>, <metadata>. Can we use that? If something is index-only, I don't think of it as an instance, and it feels like we'd be shoehorning data into an "instance" because we have the instance hammer, so want the extra data to be an instance. Please correct me if I've misunderstood the nature of an instance.

grechaw commented 8 years ago

I think you have the notion of instance right, but it's more about SMOF (small model up front). Only define the things you know you'll need. Successive iterations will expose more of the common model.

This iterative approach I realize may not be a good assumption. Is it acceptable for folks to create just a sparse model and expect to migrate it forward over time?

jmakeig commented 8 years ago

@damonfeldman said,

I actually like <index>, <serve>,<instance>, <metadata>. Can we use that?

The effect is the same though, with index, serve, and instance: materialization of some model, wether explicit from and Entity Services definition or implicit from some code. What if you want to both index and serve a particular instance of an entity? The distinction gets fuzzy and unhelpful. (Yes, I’m contradicting what I asserted above.) Couldn’t we collapse these into an es:entity?

I updated my more detailed proposal above to reflect this (es:instance and es:entity).

jmakeig commented 8 years ago

This iterative approach I realize may not be a good assumption. Is it acceptable for folks to create just a sparse model and expect to migrate it forward over time?

The iterative approach should be our default assumption. One of the reasons MarkLogic is better is because you can model gradually as you understand the data/requirements. There may be greenfield applications that start with a mostly formed model upfront, but I think that’s the exception not the rule.

grechaw commented 7 years ago

I don't see anything actionable here right now -- it could be that Data Hub and Entity Services have complementary ways of editing the envelope now, or that a kind of entity services empty-envelope-generator function would help.

Set milestone to backlog, we can revisit as needed.