SimpleClient / Schemas / Translators API

As referenced in Pull #20

Many of these depend on the WIP schema code, so we'll work on that first.

TODOs for schema and SimpleClient API:

[ ] Code usage examples. Probably should be sketched out before each API extension is made, to guide us and be sure we're meeting the critical needs of the Frontend & Indexer.
[ ] Ingest exactly 2 sizes of images for each artwork - original high res, and a lower resolution (512px width?) and provide access to these 2 sizes form the client reading API as well -- this requirement is coming from the Frontend project.
[ ] Standardize the location of these 2 sizes of cached image files in a place where both the Indexer and Client can access them, to be served out to the Frontend. Also, we should ensure that the file-like objects given by the Client API have a name attribute that points to the image file, and these should be generated in an easily reproducible deterministic way. -- This assumes that we intend for the Client & Indexer to always be on the same machine, which I suspect we do?
[ ] Support native IDs for artefacts and entities, to further prevent multiple inserts of the same real-world artefact / entity producing multiple objects in the blockchain.
[ ] Support multiple entity attribution links per artefact (multiple artists).
[ ] Figure out exact API for creating / deleting artefact duplicate / derivation links.
[ ] Eventually move SimpleClient code to Client package, but not before the schema / normalizer changes settle down, to avoid breaking the working code that the Frontend is depending on.

Partial sketch of some schema enhancement ideas, as discussed on slack:

// Note: "QQxxxxxx" are mediachain addresses. Could also use uuid type IDs here.

{"artefacts":{"QQ777777":final_artefact_goes_here},
 "linked_entities":{"QQ88888":final_entity_goes_here, "QQ99999":final_entity_goes_here},
 "raw_input_views_sightings":[{"feeds_nodes":["QQ888888"],
                "sighting_id":"https://flickr.com/user-888",
                "source":"https://flickr.com/api/users/888",
                "timestamp_utc":1268934291,
                "value":"user_json_goes_here",
                },
               {"feeds_nodes":["QQ99999"],
                "sighting_id":"https://flickr.com/user-999",
                "source":"https://flickr.com/api/users/999",
                "timestamp_utc":1268934291,
                "value":"user_json_goes_here",
                },
               {"feeds_nodes":["QQ777777", "QQ888888"],
                "sighting_id":"https://flickr.com/nice-photo-456",
                "source":"https://flickr.com/api/galleries/456",
                "timestamp_utc":1268934292,
                "value":"gallery_json_goes_here",
                },
               {"feeds_nodes":["QQ777777"],
                "sighting_id":"https://flickr.com/nice-photo-456",
                "source":"https://flickr.com/api/photos/456",
                "timestamp_utc":1368934293,
                "value":"photo_json_goes_here",
                },
               {"feeds_nodes":["QQ777777"],
                "sighting_id":"https://flickr.com/nice-photo-456",
                "source":"https://flickr.com/nice-photo-456",
                "timestamp_utc":1468934294,
                "value":"scraped_html_content_goes_here",
                },
               ],
}

Key points being sketched here:

Support multiple raw input sources, which get joined / merged together in the final output.
Sighting IDs to indicate when multiple "views" are all views of a single sighting.
"artefacts" & "linked_entities" - Maybe instead of deep recursive enrichment of a metadata tree, we can have firmer layered separation, as in other layered protocols? It'd make testing and extension of the protocol much easier.
Each input indicates which blockchain objects it influenced in the final output. We may want to extend this further, to have a mapping indicating the exact character spans of each input byte string, which influenced each character span of each field in the final output values.
Timestamps and source identifiers for all raw inputs.

Discussion:

"Sightings" - when the same image is shared on a social network multiple times.
"Views" - when there are multiple complimentary views of a single sighting, e.g. multiple JSON API calls to get the full metadata, the HTML page for the image.

Metadata will often end up coming from multiple sources, in addition to multiple sightings of the same object at different locations on a single source. This suggests we allow multiple "raw inputs" for each metadata. These multiple input tables will then be joined (like an sql join) or merged (two similar tables with different levels of completeness that are merged into a more compete record) to create the final output.

To avoid excessive duplication when displaying sightings info on the Frontend, e.g. for if a single image upload can be accessed via a official JSON API, mobile page, desktop page, app JSON -- we give all of these "views" of the same sighting, the same sighting ID. Here, the sighting ID is some kind of natural identifier from the dataset, which all sources of metadata for a particular image should be able to independently generate identically.

mediachain / mediachain-indexer

SimpleClient / Schemas / Translators API #21