mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain
33 stars 14 forks source link

SimpleClient / Schemas / Translators API #21

Closed autoencoder closed 8 years ago

autoencoder commented 8 years ago

As referenced in Pull #20

Many of these depend on the WIP schema code, so we'll work on that first.

TODOs for schema and SimpleClient API:

Partial sketch of some schema enhancement ideas, as discussed on slack:

// Note: "QQxxxxxx" are mediachain addresses. Could also use uuid type IDs here.

{"artefacts":{"QQ777777":final_artefact_goes_here},
 "linked_entities":{"QQ88888":final_entity_goes_here, "QQ99999":final_entity_goes_here},
 "raw_input_views_sightings":[{"feeds_nodes":["QQ888888"],
                "sighting_id":"https://flickr.com/user-888",
                "source":"https://flickr.com/api/users/888",
                "timestamp_utc":1268934291,
                "value":"user_json_goes_here",
                },
               {"feeds_nodes":["QQ99999"],
                "sighting_id":"https://flickr.com/user-999",
                "source":"https://flickr.com/api/users/999",
                "timestamp_utc":1268934291,
                "value":"user_json_goes_here",
                },
               {"feeds_nodes":["QQ777777", "QQ888888"],
                "sighting_id":"https://flickr.com/nice-photo-456",
                "source":"https://flickr.com/api/galleries/456",
                "timestamp_utc":1268934292,
                "value":"gallery_json_goes_here",
                },
               {"feeds_nodes":["QQ777777"],
                "sighting_id":"https://flickr.com/nice-photo-456",
                "source":"https://flickr.com/api/photos/456",
                "timestamp_utc":1368934293,
                "value":"photo_json_goes_here",
                },
               {"feeds_nodes":["QQ777777"],
                "sighting_id":"https://flickr.com/nice-photo-456",
                "source":"https://flickr.com/nice-photo-456",
                "timestamp_utc":1468934294,
                "value":"scraped_html_content_goes_here",
                },
               ],
}

Key points being sketched here:

Discussion:

Metadata will often end up coming from multiple sources, in addition to multiple sightings of the same object at different locations on a single source. This suggests we allow multiple "raw inputs" for each metadata. These multiple input tables will then be joined (like an sql join) or merged (two similar tables with different levels of completeness that are merged into a more compete record) to create the final output.

To avoid excessive duplication when displaying sightings info on the Frontend, e.g. for if a single image upload can be accessed via a official JSON API, mobile page, desktop page, app JSON -- we give all of these "views" of the same sighting, the same sighting ID. Here, the sighting ID is some kind of natural identifier from the dataset, which all sources of metadata for a particular image should be able to independently generate identically.

parkan commented 8 years ago

Superseded by PII design