Replace JSONPath with 'jq'

alankm commented 3 years ago

What would you like to be added:

Replace JSONPath with jq for all Workflow Data Expressions.

Why is this needed:

JSONPath lacks power and flexibility.

JSONPath was designed to query JSON documents for matches, especially arrays. It wasn't designed to handle complex logic, and the limitations in the spec show. JSONPath's inability to reference parent objects or property names of matching items forces users to contort their data in all sorts of strange ways.

For example, I wanted to put together a simple workflow that transforms an input from

{
    "name": "Alan"
}

to

{
    "greeting": "Hello, Alan!"
}

Here's my solution:

id: helloworld
name: "Hello, World!"
states:
- name: Hello
  type: inject
  data:
    output:
      greeting: "Hello, {{ $.name }}!"
  start:
    kind: default
  end:
    kind: default
  stateDataFilter:
    dataOutputPath: "$.output"

It's possible, but it's convoluted. And that's for a simple helloworld. I had to nest my solution in a sub-property named output just so that I could use the stateDataFilter to replace the entire state with the constructed output afterwards.

If we'd been using jq instead of JSONPath, that same solution could look like this:

id: helloworld
name: "Hello, World!"
states:
- name: Hello
  type: inject
  start:
    kind: default
  end:
    kind: default
  stateDataFilter:
    dataOutputPath: '{ greeting: ("Hello, " + .name + "!") }'

We wouldn't even need the functionality of the Inject State to pull it off. In fact, we could do all of this and more using just stateDataFilters on any state, dramatically simplifying the code.

Language Support

These limitations are well known about JSONPath, which has resulted in a fork of the JSONPath spec called JSONPath Plus, which gets >25% more weekly downloads than the original.

Unfortunately, JSONPath does not have great language support. JSONPath Plus has even less. For Go I found four different libraries for the former and none for the latter. Of the four libraries, not one produced the correct result in every test I tried over the course of an afternoon.

By comparison jq is relatively less popular amongst JavaScript developers, but seems to have implementations where it matters: languages that will be used to build implementations.

JSON Patch

In my earlier example, nesting the output was necessary to filter it later (due to the aforementioned limitations of JSON Path). But the reason I needed to filter it later is because my requirement was to replace the data, not enrich it.

The current specification doesn't have any good way of dealing with the issue of deleting fields, which has resulted in discussions like this one where bolting on another solution (JSON Patch) is being considered.

In an example from that discussion, input contains two properties that should be removed (vegetables and snacks):

{
  "fruits": [ "apple", "orange", "pear" ],
  "drinks": ["capirinha", "whisky", "vodka", "wine"],
  "snacks": ["french fries", "jelly beans"],
  "vegetables": [
    {
      "veggieName": "potato",
      "veggieLike": true
    },
    {
      "veggieName": "broccoli",
      "veggieLike": false
    }
  ]
}

The proposed JSON Patch solution is to include the following verbose snippet in the state definition:

stateDataTransform:
  dataInputPatch:
  - op: remove
    path: "/snacks"
  - op: remove
    path: "/vegetables"

In jq, this same problem can be tackled with this concise line in the output filter:

dataOutputPath: 'del(.vegetables) | del(.snacks)'

JSONPath is only capable of returning arrays and/or subsets of data. JSON Patch can build data, but it's a laborious process. jq isn't just for searching or querying data, it can also be used to construct, restructure, combine, or transform data.

Adoption

Why combine two solutions to get less functionality? Especially when jq can not only do it all, it has an enormous wealth of users and help information readily available? Hundreds of Stack Overflow questions answered.

Summary

Compared to JSON Path, jq usually produces cleaner workflow definitions, it has greater language support, and it is much easier for users to figure out. All while being far more powerful. In fact, jq is so powerful that it would render the Inject State obsolete.

tsurdilo commented 3 years ago

@alankm thanks for the information! I looked at jq in the past. IIRC what made me not fully consider it is the lack of an official specification and the issue with that was that we cannot guarantee that the impls in different languages would behave the same (thus we cannot ensure the portability of our workflow language). Can you give us your opinion on this as you seem to know a lot about jq?

Another possible candidate I have been looking at is https://github.com/jmespath which does include an "official" specification - https://jmespath.org/specification.html. @alankm @ricardozanini, could you guys please take a look at that one as well and see if it would also help with not having to use 2 separate languages as shown in the jq examples on this post (which would be really useful to have). Thanks.

alankm commented 3 years ago

Would a specific release's manpages be a suitable standing for an official specification? The documentation here for example seems very comprehensive. And it's version controlled, so presumably it's stable.

As for portability, jq seems like it can be compiled as a commandline tool on any platform if you like forking. It can also be compiled as a C shared object that other languages can create bindings for (both linked Python implementations do just that).

I haven't got much experience with jmespath, but I've found two interesting snippets from others who use it:

Amazon AWS CLI uses jmespath, but their own documentation says:

For more advanced filtering that you might not be able to do with --query, you can consider jq, a command line JSON processor.

And here's a discussion on another project explaining why they had to abandon jmespath.

tsurdilo commented 3 years ago

@alankm i do like the jq "streaming" capabilities, would be really useful for dealing with large data. The implementations, for example https://github.com/arakelian/java-jq however do scare me as they seem to require use of JNI. I am still not sure if there are impls in Go/Java/Python that would be easily installed / are easily available on cloud/container platforms - @ricardozanini wdyt?

I think best thing could be if we add this as an agenda item for one of our next weekly meetings and show this to the team so we can start a discussion and get everyones opinions. Would you be willing to do a quick preso on this? Our meetings are starting back up next Monday (11th) and are every Monday - https://docs.google.com/document/d/1xwcsWQmMiRN24a7o7oy9MstzMroAup31oOkM5Dru1jQ/edit#heading=h.g2rizfze8av2 so pick one and let us know so we can make sure you are on the agenda for that meet.

@manuelstein ^^ (we should also add this to the roadmap if team feels its something we should consider for next release)

Main thing is that since we do not allow options for runtimes to pick an expression language, the one we enforce within the language should be "solid" so we can use it long term, as well as maintain our workflow language portability.

falko commented 3 years ago

If we are looking for a standardized alternative to JSONPath we should consider DMN's FEEL.

pros:

Designed to be business-friendly
Technology-independent
- Not tied to any programming language
Open Standard
- Proven in practice
- Supported by several vendors
- Multiple Open Source engines available to study and learn from
- Books, Trainings, Tutorials available
- People might already know it from DMN
Technology Compatibility Kit (TCK) available
- Full test suite for the entire feature-set of the expression language
Also derived from XPath (like JSONPath)
Oracle (who invented it) also uses it on JSON

cons:

Embedded in DMN specification
context-sensitive language
- but context-free implementation possible with pre-processor or by subsetting

tsurdilo commented 3 years ago

@falko thank you for the info. we can definitely add FEEL to our discussions. When looking at this per your suggestions perviously, we did like FEEL a lot however felt that it was too tied to DMN specification and it its future developments will be focused to DMN needs only. Thus we found it hard to justify using it. I think this would be much easier if FEEL was stand alone.

ricardozanini commented 3 years ago

+1 to use jq instead of JSonPath and Patch. The only thing I'm worry about is standardization like @tsurdilo mentioned in his first comment.

That's the reason we did not consider jq in the past. I'm super fan of the project and I do believe that it will make things easier for runtimes to implement the spec and for users to actually adopted it.

@tsurdilo jq was built to work on command line, it's based on C. That's why implementations might be using it directly instead of re implementing the whole jq project. Since they are not a standard might not be clear how to port to another languages.

Since now Java runtimes are targeting native as part of their cloud strategies, hence adopting Oracle's GraalVM that could be a problem. We need to investigate further.

I'm sorry I don't have easy access to internet where I'm right now, but next week I'm back and we can work this through. +1 to bring this discussion to our next meeting.

@alankm many thanks for your detailed analysis. Much appreciated. It's a good idea to bring jq to the spec. Let's see how we can handle this.

Ah! I see that you guys forked the go SDK, hopefully we can work together to enhance the project!

tsurdilo commented 3 years ago

@ricardozanini yes, if jq libs for java cannot run on native that is a big issue for adoption imo. can we look into this please?

ricardozanini commented 3 years ago

@ricardozanini yes, if jq libs for java cannot run on native that is a big issue for adoption imo. can we look into this please?

Sure, I'll take a look later today.

ricardozanini commented 3 years ago

Hi, I've managed to run a demo with jackson-jq and Quarkus on native image mode. You can see the results here: https://github.com/ricardozanini/jq-native-poc

I see no problem running this simple demo, we should take a look at complex scenarios since the Java SDK would depend on this library as well, right? I can see use cases where the SDK tries to parse or validate the expressions defined in the workflow.

My only concern can be summarized by this picture: https://xkcd.com/2347/

I don't think it's a huge thing, but something to keep in mind.

jensg-st commented 3 years ago

@ricardozanini @tsurdilo @alankm

I have done a couple of tests with Jmespath and jq for you to compare.

Renaming keys

Renaming keys works with both solutions. Although for Jmespath you need to repeat all attributes the object should keep. If one is missing it won't be added to the output. If some objects have additional values they are set to null.

Data:

[
  {
    "key": 1,
    "name": "myname1",
    "hello": "world"
  },
  {
    "key": 2,
    "name": "myname2"
  }
]

JMESPATH: [].{id: key, name: name, hello: hello}

[
  {
    "id": 1,
    "name": "myname1",
    "hello": "world"
  },
  {
    "id": 2,
    "name": "myname2",
    "hello": null
  }
}

JQ: [.[] | .["id"] = .key | del(.key)]

[
  {
    "name": "myname1",
    "hello": "world",
    "id": 1
  },
  {
    "name": "myname2",
    "id": 2
  }
]

Deleting keys

Deleting fields is not easily possible with Jmespath. You can use the same method we used for the renaming and leave the field out. This might be ok for small JSON objects but can be a nightmare for larger, complex objects.

Data:

  {
    "id": 1,
    "name": "myname1",
    "hello": "world",
    "mykey1": "val1",
    "mykey2": "val2",
    "mykey3": "val3",
    "mykey4": "val4",
    "mykey5": "val5"
  }

JMESPATH: { "key": id, "name": name, "mykey1": mykey1, "mykey2": mykey2, "mykey3": mykey3, "mykey4": mykey4, "mykey5": mykey5 }

JQ: del(.hello)

Adding keys based on an if clause

Enriching data seems to be a common use case. At the moment Jmespath does not have proper if/else/then statements. There seems to be something close to an if statement if JSON is an array. Not having that feature seems to be problematic for data conditions as well.

Data:

{
  "name": "harry",
  "age": 27
}

The filtering of Jmespath only works on arrays. To make the filter work, the expression needs to convert it to an array first. This would work for data conditions but not for data enrichment.

JMESPATH: { temp: [ @ ] } | temp[?age>`18`]

JQ: if .age > 18 then .adult=true else .adult=false end (enrichment)

JQ: if .age > 18 then true else empty end (data condition)

{
  "name": "harry",
  "age": 27,
  "adult": true
}

Complex Scenarios

I though about a more complex scenario where I would get a exchange rate from one action and use those values for calculations. I could use jq to caculate the the price in the state data filter.

Data:

{
  "currency": "USD",
  "exchange":  0.75,
  "items": [{
    "name": "Item1",
    "price": 11.0
  },
  {
    "name": "Item2",
    "price": 9.0
  }]
}

JMESPATH: impossible

JQ: .exchange as $e | [.items[] | . + { "total": ($e * .price) } ]

[
  {
    "name": "Item1",
    "price": 11,
    "total": 8.25
  },
  {
    "name": "Item2",
    "price": 9,
    "total": 6.75
  }
]

Additional information

Jmespath calls it a specification where JQ is basically an implementation with a man page but the Jmespath specification has no well-known governing body.
Jmespath provides libraries for all major languages on their webpage. JQ has binaries for all important operating systems. There are implementations for JQ in different languages but not maintained by the JQ project.
There are far more search results, articles etc. about JQ than Jmespath. Google search queries are almost x10 more JQ than Jmespath.

I think there will be issues with Jmespath for this use case. In particular renaming keys and using it for data conditions seems to be harder and error-prone with Jmespath. In general Jmespath is not very strong with modifying data in general which might be used a lot to make APIs compatible between actions.

ricardozanini commented 3 years ago

I agree @jensg-st. JQ is by far the most popular and easiest approach. My only concern is the JQ "implementations" in different languages not maintained by the JQ project. I only played around with jackson-jq and I tested all your examples with my implementation in native mode. It worked! That's a good sign.

I believe that we can always fork the jackson-jq project or help their community to keep up to date with JQ. I don't see any problems on that matter. I'm considering creating a Quarkus extension for jackson-jq based on the demo I just created. That will help a lot people using our Java SDK for this runtime that also need to run on native mode. On serverless environments, this is a big deal.

+1 to move forward with JQ.

ricardozanini commented 3 years ago

Just a small note: jackson-jq is not an implementation of JQ, but a wrapper fro the actual implementation in C.

tsurdilo commented 3 years ago

depends on https://github.com/serverlessworkflow/specification/issues/230

manuelstein commented 3 years ago

@ricardozanini @alankm @falko @tsurdilo regarding yesterday's discussion on jq, I've found this proof that jq is in fact turing complete. I know there are some concerns with abusing TC interpreters. eBPF, for instance, supposedly uses a verification that ensures any script deterministically comes to an end. But I wonder the implications here. One important aspect is that any run of user-provided expressions requires sandboxing and you probably want to prevent a powerful script to blow up (or DoS) your workflow engine, but I don't see any risks for the jq execution to e.g. inject code or exploit any other overflow in the user space as it can only operate on the data it is being fed. @falko are there any other concerns regarding user script isolation with turing complete languages? It's only the cost of running it and risking interference with providing the service, right?

serverlessworkflow / specification