WIP: translator output validation

parkan commented 8 years ago

This uses a jsonschema autogenerated from the pexels subset of the overall metaschema here:

http://54.209.175.109:6006/schemas_normalized.js

Still needs:

[x] factor our schema loading
[ ] generate real metaschema jsonschema
[x] resolve discussion in mediachain/mediachain-indexer/pull/19
[x] check other fields, deal with substitute elements

parkan commented 8 years ago

@yusefnapora @tg I think this mostly good to go once I have an example to generate the jsonschema from?

yusefnapora commented 8 years ago

awesome, this looks good. are we using the schema from https://github.com/mediachain/mediachain-indexer/pull/19 as a basis of the metaschema for now?

parkan commented 8 years ago

Basically, still need to pull an updated/cleaned up version though

autoencoder commented 8 years ago

To be sure we're all talking about the same stages of metadata representation, here's a diagram:

[(A) form as seen by external apps who are writing]
               |
               v
[(B) form after being processed by client writer]
               |
               v
[(C) form as stored on the blockchain DB]
               |
               v
[(D) form after being read back by client reader]
               |
               v
[(E) form as seen by external apps who are reading]

Nice to have features, from the "external apps" side:

Forcing schema typing consistency for all field values, between all records -- Many of the indexes we're already using, e.g. ES, ndarray, are actually backed by a flat relational-like storage engines, which only allow 1 data type per field, over the entire collection (ignoring nulls or other "empty" indicators). These indexes don't allow you to change the type of a field ever. Presence or absence of fields can change, but not the types of the field values.
Ability for apps to add new fields to the schema, without having to modify or augment any client code -- could be highly valuable for external apps built on this blockchain. In this case, it's only important to enforce basic things like: inter-record schema consistency, maximum recursion depth, maximum serialized metadata bytes.
Allow simple end-to-end checks -- would be nice if the metadata input at (A) would be byte identical to the metadata output at (E). The client / blockchain will likely enrich the metadata with e.g. "ref" IDs, but maybe those can go somewhere else so the original metadata can still be trivially checked to see that it's identical to what was input? E.g. {"original_metadata":xx, "enriched_fields":yy}.
Enforcing the presence of few more fields that will be useful to some apps -- At the moment can only think of: an _id field that an end-user inserter / updater can use, instead of only being able to identify items by the "ref" id which is based on the hash of the metadata.
Schema defined in 1 single place -- Feels like it would be ideal if the schema was defined only in 1 single place, to avoid out of sync problems. Seems like the simplest way to achieve that is if the schemas were auto-detected from the data? Example: There's a mediachain music-sharing app that's already working on some metadata, and we want to be sure that that our new app creates metadata that's compatible with the old app. Our schema detector auto-detects the schema of metadata produced by the old app & auto-detects the schema of metadata produced by our new app => then alerts us to any conflicts.

* "External apps" are apps built on top of the blockchain such as indexers, bulk upload applications, blockchain explorers, etc.

Edit: Also have a small collection of code for auto-generating schema definitions based on the combined shape of all records, checking inter-record consistency rules, and seeing union / intersection / diffs between many schema records. This code was needed to resolve some ES typing consistency / record normalization bugs. Will share some of that when the other priorities are out of the way.

parkan commented 8 years ago

@autoencoder below

Forcing schema typing consistency for all field values, between all records -- Many of the indexes we're already using, e.g. ES, ndarray, are actually backed by a flat relational-like storage engines, which only allow 1 data type per field, over the entire collection (ignoring nulls or other "empty" indicators). These indexes don't allow you to change the type of a field ever. Presence or absence of fields can change, but not the types of the field values.

I'm a bit inclined to think of this as an application level problem, because enforcing schema consistency (in the CAP sense) in addition to data consistency at the DB level sounds like quite an endeavor 😆

Ability for apps to add new fields to the schema, without having to modify or augment any client code -- could be highly valuable for external apps built on this blockchain. In this case, it's only important to enforce basic things like: inter-record schema consistency, maximum recursion depth, maximum serialized metadata bytes.

Can you give an example? I'm not sure how inter-record schema consistency is really possible with an immutable data model; I think the absolute best we can hope for is thrift/protobuf-style forward compatibility where fields can be added or removed but not repurposed.

It'd be extremely nice for testing if the metadata input at (A) was byte identical to the metadata output at (E). The client / blockchain will likely enrich the metadata with e.g. "ref" IDs, but maybe those can go somewhere else so the original metadata can still be trivially checked to see that it's identical to what was input? E.g. {"original_metadata":xx, "enriched_fields":yy}.

This was part of the reasoning behind separating the "data" and "structure" layers: the meta/data object does not make any references to objects within the system. This is also why I asked you to remove the mediachain_id key in the schema example, and otherwise clean out anything that is system/structure level 😄

Relationship metadata, like authorship, remix relationships, etc is represented through Link cells at the structure layer. I found this a bit awkward at first, but the more I think about it the more sense it makes to me.

autoencoder commented 8 years ago

fields can be added or removed but not repurposed.

Yes, that's exactly the rule some of our storage engines in the Indexer expect.

You mean "CAP consistency" as in all nodes being aware of the types of the fields? That's ideally solved by the blockchain mechanism, right? -

Beyond a certain deadline, a block is committed which locks the types to a certain definition, regardless of the actual time that each end-user originally submitted the records for inclusion in the blockchain. Conflicting field types within the same block are rejected, and only one wins.

parkan commented 8 years ago

I'm just not sure what establishes the canonical-nature of a schema, especially if it's presented by an example. This is probably a philosophical discussion as much as anything else.

parkan commented 8 years ago

@tg to respond to your original comment more generally: the validation mentioned here happens in (B), before being written to (C). When you get a chance, let's drop in a union/metaschema that I can generate the jsonschema from and close this.

parkan commented 8 years ago

OK I'm gonna merge this to make things a bit easier for myself

mediachain / oldchain-client

WIP: translator output validation #60