mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain
33 stars 14 forks source link

Schema Discussion. #19

Closed autoencoder closed 8 years ago

autoencoder commented 8 years ago

Update: Schema shown here is outdated. Updated docs in progress.

This represents the inner data section of the artefact schema that will be output to the front-end via the Indexer API. This may slightly differ from the artefact schema stored in the blockchain, in that it may have some additional fields added to it, e.g. sightings_list (?)

Questions from previous discussions:

IDs:

Other questions:

Decisions so far:

parkan commented 8 years ago

Below

This represents the inner data section of the artefact schema that will be output to the front-end via the Indexer API. This may slightly differ from the artefact schema stored in the blockchain, in that it may have some additional fields added to it, e.g. sightings_list (?)

I think we should move the focus one level up, to the meta element -- this will make some of the following concerns much clearer

Questions from previous discussions:

Having both basic string versions + more complex versions for some fields, to make it easier on apps trying to use this metadata, e.g. having both 'author_name':'Picasso and Bob'' and 'author':[{'name':'Picasso', 'role':'author'},{'name':'Bob','role':'author'}]

This feels like a "display name" issue and probably out of scope for us to worry about. The footprint of the data in the distributed system should be minimal (that being said, we should discuss which representation is appropriate here -- I'm leaning towards the latter one, because the former can be constructed from it w/o loss of information)

Having all string-like fields always be lists of strings instead of a bare string, as DPLA does.

This actually seems to be kinda inconsistently applied on their end, even different sourceResource sections in the same document sometimes don't do it consistently. Do you know what the reasoning is?

null to indicate missing values.

Likewise here, I prefer "all fields optional" over including the fields and supplying explicit nulls. Though this may be useful for the semantics of a strong "there is no value for this" assertion

Support for multi-layer nesting of enriching modifications? Citing of multiple raw sources that are fused together?

I think maybe the problem here (and with some of the other questions) is that we're conflating the on-chain representation (i.e. individual update cells) and the folded result. This is obviously somewhat inevitable because we're trying to design from both ends of the system, but let's make an effort to separate them.

There's an assumption that the raw input is JSON -- instead of e.g. XML, raw HTML, text files, callback-wrapped JSON, CSRF guarded JSON?

The raw source is stored as literal bytes in IPFS and could easily be anything--this is exactly why. From the perspective of sourcing metadata, the relationship between the source data and the normalized/translated form is opaque -- it just happens to be a simple field selection/renaming in this case.

Other questions:

[Indexer-API] Move sightings_list and other potentially huge fields to other API endpoints? Or, allow pagination on some sub-fields e.g. sightings_list - similar to the facebook API?

parkan commented 8 years ago

Ok, general thoughts:

autoencoder commented 8 years ago

Merging in first iteration of changes.

parkan commented 8 years ago

@autoencoder Re: your last edit (about looking up by external ID) -- this is something we've been promising as a very first class feature and it probably deserves more attention. I was thinking this should be a kind of hand-crafted index