Translator DSL - Githubissues

yusefnapora commented 8 years ago

Starting this issue for discussion about a declarative translator DSL design. The dynamically-loaded python modules we've got now get the job done, but are vulnerable to malicious or accidentally-damaging code execution.

Ideally we want a DSL for extracting and transforming the data we care about from its native format into a collection of mediachain records. To avoid RCE issues, an "external DSL" is preferred to something embedded inside a general-purpose host language.

Features I'd like to see:

can embed / link to XML or JSON schema to validate inputs before translation
simple syntax for selecting fields (maybe XPath? or something similar...) and mapping to output fields
some textual transformation functions to "massage" inputs a bit
filtering based on field values. (e.g. if thing.role == artist, etc)

Ideally, we want something that doesn't "feel like programming", although that may be unavoidable to some extent.

Implementation thoughts...

A while ago I looked at Xtext, a framework for creating DSLs and generating an object model from them. It's an extremely java-centric solution, but there is a python clone called textX that could be interesting. It has a very similar grammar and generates a graph of python model objects from the DSL input. It doesn't have some of the fancier features (like generating an IntelliJ plugin or web-based editor for your language). But it seems nicer than rolling our own parser, etc...

An interesting javascript project I ran across during earlier research: http://defiantjs.com/ - converts JSON to/from XML and uses XPath for query and filtering. XPath is very flexible, and could be a decent choice for field selection. We could use the same idea and convert from json to xml for query / extraction using python's xml.etree.ElementTree classes, which support XPath queries.

Here's what a getty translator might look like with XPath style selectors:

translator:
    name: getty
    format: json
    schema: /ipfs/QmF00  # link to json schema for input

artefact:
  copy(title, artist, collection_name, caption, date_created) # copy without transformation

  _id: "getty_" + //id
  editorial_source: //editorial_source/name
  keywords: //keywords/text
  images: asset_link(//display_sizes)

artefactCreatedBy:
    entity:
        name: //artist

One thing that stands out is that the XPath selectors can potentially match multiple fields in the input, so we'd either need to consider cardinality per field (e.g. title is a single string, but keywords is a list), or else just say everything is a list and can have multiple values.

Anyway, these are just some thoughts that have been rattling around my head for a while. I figure it's worth considering what our ideal DSL would look like before we start diving into anything :)

parkan commented 8 years ago

Some other JSON translators and related work to potentially look at:

http://goessner.net/articles/jsont/ http://ajaxian.com/archives/transforming-json https://github.com/bazaarvoice/jolt https://www.p6r.com/articles/2008/05/06/xslt-and-xpath-for-json/ https://www.w3.org/TR/xslt-30/#json http://jsoniq.org/

and, of course, jq

parkan commented 8 years ago

Other option is to use https://newville.github.io/asteval/, which is already used by tg in a couple of places

parkan commented 8 years ago

After considering the full range of work facing us, I'm going to deprioritize this for the near future, because it's a pretty complex undertaking with high expressiveness and security requirements. Let's consider translators "use at own risk" for the moment.

Unless:

new contributor comes onboard to focus on this
a lot of writing activity is seen on testnet from 3rd parties
workflow management implementation ends up being generalized into this territory

mediachain / oldchain-client

Translator DSL #70