Schema Discussion. - Githubissues

autoencoder commented 8 years ago

Update: Schema shown here is outdated. Updated docs in progress.

This represents the inner data section of the artefact schema that will be output to the front-end via the Indexer API. This may slightly differ from the artefact schema stored in the blockchain, in that it may have some additional fields added to it, e.g. sightings_list (?)

Questions from previous discussions:

Having both basic string versions + more complex versions for some fields, to make it easier on apps trying to use this metadata, e.g. having both 'author_name':'Picasso and Bob'' and 'author':[{'name':'Picasso', 'role':'author'},{'name':'Bob','role':'author'}]
Having all string-like fields always be lists of strings instead of a bare string, as DPLA does.
null to indicate missing values.
Support for multi-layer nesting of enriching modifications?
Citing of multiple raw sources that are fused together?
There's an assumption that the raw input is JSON -- instead of e.g. XML, raw HTML, text files, callback-wrapped JSON, CSRF guarded JSON?

IDs:

The blockchain currently allows updates / replacements / lookups by hash of the artefact metadata. Is there any value in allowing lookups in other ways, e.g. find(provider.name == "getty" and id_at_provider = 1234) or a convention like find(id == "getty:1234"), so you can update the metadata for Getty image 1234 even if you don't know what the original metadata was that was used to insert it? Note: Probably should be a single concatenated field, so joining isn't required.

Other questions:

[Indexer-API] Move sightings_list and other potentially huge fields to other API endpoints? Or, allow pagination on some sub-fields e.g. sightings_list - similar to the facebook API?

Decisions so far:

[x] Add "official_permalink" field (Only allow a single permalink here?)
[x] Add "source_url" field(s)
[ ] Add "is_adult" flag
[ ] Add metadata from externally-linked pages (DPLA & Pexels)

parkan commented 8 years ago

Below

This represents the inner data section of the artefact schema that will be output to the front-end via the Indexer API. This may slightly differ from the artefact schema stored in the blockchain, in that it may have some additional fields added to it, e.g. sightings_list (?)

I think we should move the focus one level up, to the meta element -- this will make some of the following concerns much clearer

Questions from previous discussions:

Having both basic string versions + more complex versions for some fields, to make it easier on apps trying to use this metadata, e.g. having both 'author_name':'Picasso and Bob'' and 'author':[{'name':'Picasso', 'role':'author'},{'name':'Bob','role':'author'}]

This feels like a "display name" issue and probably out of scope for us to worry about. The footprint of the data in the distributed system should be minimal (that being said, we should discuss which representation is appropriate here -- I'm leaning towards the latter one, because the former can be constructed from it w/o loss of information)

Having all string-like fields always be lists of strings instead of a bare string, as DPLA does.

This actually seems to be kinda inconsistently applied on their end, even different sourceResource sections in the same document sometimes don't do it consistently. Do you know what the reasoning is?

null to indicate missing values.

Likewise here, I prefer "all fields optional" over including the fields and supplying explicit nulls. Though this may be useful for the semantics of a strong "there is no value for this" assertion

Support for multi-layer nesting of enriching modifications? Citing of multiple raw sources that are fused together?

I think maybe the problem here (and with some of the other questions) is that we're conflating the on-chain representation (i.e. individual update cells) and the folded result. This is obviously somewhat inevitable because we're trying to design from both ends of the system, but let's make an effort to separate them.

There's an assumption that the raw input is JSON -- instead of e.g. XML, raw HTML, text files, callback-wrapped JSON, CSRF guarded JSON?

The raw source is stored as literal bytes in IPFS and could easily be anything--this is exactly why. From the perspective of sourcing metadata, the relationship between the source data and the normalized/translated form is opaque -- it just happens to be a simple field selection/renaming in this case.

Other questions:

[Indexer-API] Move sightings_list and other potentially huge fields to other API endpoints? Or, allow pagination on some sub-fields e.g. sightings_list - similar to the facebook API?

parkan commented 8 years ago

Ok, general thoughts:

Let's try to repackage this as a chain cell, with appropriate toplevel metadata (raw ref/source, translator version, etc) and minus some of the derived/indexer-specific stuff. I think this will give us a bit more clarity -- the folded representation, with multiple sources, will be an incremental step
I wonder if some of the potentially fat subfields (tags, exif) belong elsewhere

autoencoder commented 8 years ago

Merging in first iteration of changes.

parkan commented 8 years ago

@autoencoder Re: your last edit (about looking up by external ID) -- this is something we've been promising as a very first class feature and it probably deserves more attention. I was thinking this should be a kind of hand-crafted index

mediachain / mediachain-indexer

Schema Discussion. #19