schibsted / jslt

JSON query and transformation language
Apache License 2.0
638 stars 120 forks source link

JSLT without Jackson #123

Open larsga opened 4 years ago

larsga commented 4 years ago

The current JSLT implementation is so tightly bound to Jackson that every value is a Jackson JsonNode, but there are users who want to use JSLT without Jackson. There is an experimental branch with a custom JSLT VM which could be developed further to provide a Jackson-less JSLT.

There are three main challenges here:

All of these problems can be solved, but it would be good to get some input on to what extent there are people out there who need a JSLT with zero runtime dependencies. And also what requirements these people have for the input/output representation.

Input wanted!

mrange commented 4 years ago

Hi Lars.

We (Schibsted Pulse) are one of the users that would value a JSLT with no dependencies. Without boring everyone with the details I just mention that we run into conflicts between Spark which has a dependency on Jackson and JSLT that has a dependency on Jackson of a different version.

Below are some thoughts around the problem but I don't claim they are well-thought out.

As the JsonNode is exposed in the API using shadowjar to pull in jackson into the jar and rename the namespace don't seem feasable.

An idea I have been toying with is that JSLT provides it's own writable JSON DOM (perhaps functionally identical to Jackson in all ways that matters to JSLT?).

As a convenience to the users there could be a JSLT.Jackson module that most users depend upon that implement JSON => JSLT JSON DOM => JSON transformations.

That way the core JSLT could have no dependencies and users like us that have specific runtime demands have the possiblilty to implement JSON => JSLT JSON DOM => JSON on a specific version on Jackson or using a different parser all together.

Important is that the API still have to be perceived as convenient to use so our maybe specialized needs don't reduce usability for the normal user. As we might be the odd guys here I would be ok with a slightly less convenient API for us as long as we can exclude Jackson and provide our own parsing.

Another option is to provide several version of JSLT with different Jackson version but I am not a fan of that having to provide several versions of an internal library for different versions of scala.

Regards,

Mårten

jarno-r commented 4 years ago

I've made an experiment replacing Jackson with a simple JSON library: https://github.com/jarno-r/jslt/tree/json-lib/src/main/java/com/schibsted/spt/data/jslt/json The fork builds, all of the tests pass and the CLI works.

The main JSLT code is mostly unchanged, except for replacing all Jackson classes with their equivalents. I've replaced almost all uses of asXxxx() methods with the equivalent xxxxValue(), leaving only a couple, where the behaviour differs.

The JSON library is a pretty straightforward replacement for the corresponding Jackson classes, because I just wanted to see how much changes to JSLT are needed. I don't think it's very well designed at the moment.

There's no parser. It's currently using Jackson for parsing. I don't see the lack of a parser as a big problem, since a JSON parser is not hard to implement.

I like the idea of a VM for JSLT, but I think the representation of input & output JSON to JSLT is orthogonal to the implementation, so I would focus on it from an API design point of view.

Here are some thoughts about how I was thinking of improving the JSON lib:

  1. Make objects & arrays immutable (other classes already are). JSLT is pure functional (or very close), so this makes a lot of sense.
  2. Have only a 'double' based number class. JS numbers are 64-bit floats. Int fits in long and in double, so int is not needed. Integers between +- 2**53 are representable in double, which is plenty. Also having integers and floats handled separately makes for weird semantics, since JSON doesn't really distinguish number types.
  3. Make all of the classes be interfaces or abstract classes. This allows multiple implementations for performance optimizations and might help retain original representation (i.e. escape sequences for strings & exact representation of numbers.), if that's desirable. (E.g. a string could be represented as a pointer & length to an input byte[] buffer containing a UTF-8 encoded string, which might never need to be converted to a String. )

Jackson allows NumberNodes that are BigDecimal. This is potentially useful with sums of money, but JSLT doesn't currently really support that. Not sure if that needs to be considered.

Jackson also supports other formats, such as YAML. Should it be an objective for JSLT too?

jarno-r commented 4 years ago

There are two other options to having a JSON library within JSLT:

  1. Use 'native' types and values, like null, Boolean, Number, String, Map<String, Object> & List.
  2. Use a wrapper interface that exposes all of the methods needed to manipulate all kinds of JSON values, without casting to subtypes. E.g. the wrapper would have methods like isNumber(), isObject(), doubleValue(), stringValue(), get(int), get(String), etc (kind of like JsonNode already does). A factory for constructing new values would also need to be provided.
  3. I don't really like the first option, because it would mean passing Objects around and doing a lot of instanceof checks. Also not sure how null would work.

    The second option would be nice in that it wouldn't require any copying between Jackson and another format. One potential issue is that it's pretty opaque to JSLT.

    ecerulm commented 4 years ago

    There is an experimental branch with a custom JSLT VM which could be developed further to provide a Jackson-less JSLT.

    What branch is that?

    Functions: these have to be reimplemented. Or one might have a core layer based on pure Java types, which the VM and the existing implementation both translate to.

    Uhmm, I think it ok to use Jackson internally, as long as you do a shadowJar/FatJar/UberJar where the jackson classes are relocated. The real "problem" is caused by exposing Jackson classes in the JSLT api that is apply(JsonNode), etc.

    If you think Jackson is too heavy weight to embed, then I think it's possible to do a separate JavaCC grammar to parse pure JSON and use the resulting AST (or a layer on top of that) as the internal json library. So that functions, etc are implemented in terms of that internal api and not Jacksons.

    Input representation: how is the JSON data passed to JSLT in this case?

    java.lang.String the api user can serialize / deserialize to String from their favorite json library and in many cases is likely that the original input is already a String

    Output representation: how is the JSON result returned?

    java.lang.String also. The api use can deserialize the output json string in their favorite json library.

    wdonne commented 4 years ago

    When I work with JSON I always use JSON-P. This is only a set of interfaces. It works with a Service Provider Interface. The implementation I use is Glassfish. However, you can write your own by implementing javax.json.spi.JsonProvider. Currently, in order to hide Jackson from the rest of my code, I have wrapped JSLT in this: https://www.javadoc.io/static/net.pincette/pincette-json/1.3/net/pincette/json/Jslt.html.

    larsga commented 4 years ago

    JSLT without Jackson

    A JSLT implementation that does not depend on Jackson may be useful for:

    • Users may need to avoid clashes with other dependencies that need different versions of Jackson.

    • Potentially better performance by using a more efficient JSON representation.

    • Support processing binary formats such as Protobuf or Avro for higher performance.

    However, there are also a potential downside, in that we may be forced to implement JSLT more than once. We want to avoid code duplication as far as we can.

    There are a number of different approaches that could be taken, with different trade-offs.

    Define JSLT JsonValue interfaces

    We could define a set of Java interfaces to represent JSON values, then rewrite the JSLT implementation in terms of those. There could then be an implementation wrapping Jackson nodes in this interface, and another native implementation with its own parser.

    The downside is that performance for Jackson users would probably suffer a little. Exactly how much is not clear. Most likely it would not be possible to get any of the potential performance benefits with this approach.

    (I see @jarno-r has tried implementing this. It would be very interesting to get some performance measurements to help us see what the impact is.)

    Drop-in Jackson replacement

    Another possibility would be to implement the Jackson JsonNode interface in a separate artifact that provides a parser and the methods and classes that JSLT needs.

    The only downside to this would be that we probably could not get the potential performance benefits.

    Also, it probably would not help those who have dependencies that require a different Jackson version from the one JSLT requires.

    Packed JSON representation

    We could make a completely different type of JSON representation: basically mainly ints in arrays. Experiments have been made with this, showing potential for performance improvements, but without showing any immediate improvement.

    The downside here would be that we end up with two JSLT implementations: one with Jackson and one without. It's possible that many of the function implementations could be reused, however.

    Generated Java code with adapters

    We could make a JSLT implementation that generates either Java source code or JVM bytecode representing the JSLT logic. This has already been tried (JVM bytecode) and got a 20% speedup on the first attempt.

    If we let the core logic be performed on values that are String, boolean, long, and so on, much of the JSLT implementation would be representation-independent.

    Different representations of objects and arrays could then be catered to using adapters that generate different code for touching them, depending on what representation the user wants to use.

    Note that this might also be used to support protobuf/avro input.

    There are a few complications that might make this more difficult than it sounds, but it may be worth at least exploring in greater detail, because if it does work there don't seem to be any downsides.

    Tentative conclusion

    It's too early to pick an approach, but it seems clear that both defining own JSLT interfaces for JSON and code generation should be explored. Once that's been done we may be in a better position to make a choice.

    larsga commented 4 years ago

    @jarno-r It would be interesting if you could do a benchmark comparison of your code with the existing JSLT code. Especially if you could push your code to a forked version of the repo for investigation. Are you interested in doing that?

    larsga commented 3 years ago

    I wrote my own JSON interfaces and ported JSLT to run on top of that. I did a benchmark where I ran Jackson JSON objects wrapped for this new interface through filters and transforms, and compared it with processing Jackson objects directly. That approach seemed to result in a 7-8% slowdown. Which is not bad at all.

    I should note that in this case the output is in the internal JSON representation, and may need to be translated. Or we may need to produce the output as Jackson objects wrapped in this new representation, so that retrieving the Jackson data is trivial.

    Users who want to transform serialized JSON input into serialized JSON output probably don't care about Jackson at all. So I should probably do another benchmark for that, because it would involve parsing, transform, plus serialization. It's entirely possible that we could get better performance for this use case.

    It would be interesting to hear from users who want to use JSLT with Jackson. What are your use cases? Is it important to get Jackson objects as the output? This is important for me to understand, so that I don't make a new JSLT version that is unusable for you.

    larsga commented 3 years ago

    The work done so far is available on branch own-json-interfaces

    larsga commented 3 years ago

    93% of the tests now pass. A good part of what remains is parsing, which is easily fixed.

    Transforms with own JSON representation is now marginally slower than with Jackson objects. (Not sure why. Will work on improving this.)

    Parsing JSON now seems to have same performance as Jackson.

    biochimia commented 3 years ago

    I have some concerns about the approach taken in https://github.com/schibsted/jslt/tree/own-json-interfaces to drop the Jackson dependency.

    Rolling a custom JSON parser adds significant complexity to the project and is also added risk on the effort to detach JSLT from Jackson. Do we want to maintain a JSON parser in addition to the JSLT parser and language implementation?

    I see JSLT as a language to define transformations on a JSON-like object model. On that level, the language doesn't rely on Jackson or JSON. It's the current runtime implementation that picked Jackson and JSON.

    Could we move, instead, in a direction where we take the current runtime implementation and split the language from the runtime such that there is a Jackson/JSON-free core language, and a jslt-jackson as its first implementation? Ideally, the split could allow language and implementation to evolve somewhat separately.

    biochimia commented 3 years ago

    One way we could define the success criteria for a JSLT without Jackson could be that we're able to usefully maintain both the existing jslt-jackson interface and the new jslt-own-json-interfaces sharing a common engine, but maybe not being able to share the implementation of custom functions: users need to pick a specific runtime to extend.

    larsga commented 3 years ago

    The motivation for dropping Jackson is that some users have dependencies (such as Spark) that require versions of Jackson that are incompatible with the version we have. That means we can't have Jackson among the dependencies at all, so using the Jackson JSON parser isn't going to work.

    The good part is that maintaining a JSON parser is not much effort. JSON is a very small language, so parsing JSON is hugely easier than parsing JSLT. In fact, the JSLT parser contains a JSON parser. The most difficult part is (believe it or not) decimal numbers.

    On that level, the language doesn't rely on Jackson or JSON. It's the current runtime implementation that picked Jackson and JSON.

    Absolutely true.

    Could we move, instead, in a direction where we take the current runtime implementation and split the language from the runtime such that there is a Jackson/JSON-free core language, and a jslt-jackson as its first implementation?

    Initially I tried thinking of ways to do that, but failed to come up with anything. The problem is that when the entire implementation is based on JsonNode objects as the representation of JSON values absolutely all of the code depends directly on Jackson.

    I wrote a longer analysis that you may want to read.

    I agree there is some cost to maintaining a separate JSON parser, but now that I've actually written the parser I find the cost is lower than I feared it might be. Performance tuning is the main cost, but the plus side is we can now optimize specifically for the use cases we have without worrying about lots of use cases Jackson must meet that we don't need to.

    biochimia commented 3 years ago

    I understand and can sympathise with some of the motivations for moving away from Jackson. We regularly have to deal with dependency hell across a few of our projects with conflicting requirements for Jackson (looking at you, Spark, Finatra, Kafka, ... you know the bunch 😉). In terms of dependencies, the good part, I think, is that JSLT builts on top of a relatively stable core of Jackson, so we've been able to enforce different Jackson versions so long as we keep all the Jackson libraries at a compatible version.

    I also recognise that JSON is a relatively small language, that is also somewhat embedded in JSLT itself.

    The main concern I have is that JSLT code we run is essentially code we maintain and control. While JSON data is essentially untrusted external input that we run through it. There is some value in using a JSON parser that has been hardened by time and is maintained on its own.

    I also have some concerns over the upgrade path of a once over switch away from Jackson, as it will currently imply changes to some of our core libraries that are shared across a few projects. This will require that we dedicate time to undergo the proposed update.

    biochimia commented 3 years ago

    Besides concerns that may be more operational than development related, I think these two stated goals require that we ponder where we're going with this effort:

    • Potentially better performance by using a more efficient JSON representation.

    • Support processing binary formats such as Protobuf or Avro for higher performance.

    Performance

    With the concern for performance of JSON parsing/representation, rolling our own parser means the project is committing to maintain the most efficient parser/representation (for the use cases of the language). Its one thing to have encouraging numbers from initial experiments, but it's a different one to commit to developing and maintaining the edge.

    If we are to take this approach, it would be good to state what are the constraints that make it possible to develop higher performing JSON handling within JSLT than it is to maintain one externally. This is not clear to me, also because I have not yet taken the time to take a closer look at the approach you took in your branch.

    Supporting other formats

    If there is the goal to support alternate binary formats, I'm not sure that moving away from established libraries gets us closer to the goal. Jackson, for instance has support for different binary formats via https://github.com/FasterXML/jackson-dataformats-binary, and some other third-party libraries.

    From our side, we have some experience with running JSLT code on Avro input data in Kafka. I'll admit that our current approach is not the most efficient one. After experimenting with different Avro libraries we ended up not using Jackson, which means we pay a penalty from an Avro-to-JSON serialization followed by JSON parsing. (The main holdup to using Jackson directly was the lack of integration with Confluent's Schema Registry, which may be possible to address as an issue).

    biochimia commented 3 years ago

    Would an approach like that taken in https://github.com/jimblackler/jsonschemafriend#format be feasible?

    It seems that, in that project, they define interfaces in terms of Java interfaces, and the user is then responsible for plugging in the JSON parser and bridging the two.

    • java.util.Map<String, Object>
    • java.util.List<Object>
    • java.lang.Number
    • java.lang.String
    • java.lang.Boolean
    • null
    larsga commented 3 years ago

    Letting the user supply the JSON parser and representation is a real no-no, because it's going to make adoption so much harder for users. We have to give them a complete package they can use right out of the box.

    But I have to say I find it difficult to understand what you're concerned about. JSLT is an entire language with functions, value types, operators, expressions, etc etc. JSON, by contrast, is a very small language. So small, in fact, that json.org has room for the entire grammar in two different representations on the front page.

    In my experience the hardest part of parsing JSON is parsing the numbers. Seriously. And the numbers() function requires us to have a number parser, anyway.

    So ... why worry about this?

    The approach taken in the new branch will let anyone who wants to plug in their own JSON parser and representation, anyway, so that option will still be there. It's just that you won't have to.

    Yeah, sure, there's a cost in effort to maintain a JSON parser, but it's my effort.

    biochimia commented 3 years ago

    Apologies. In my last comment I meant to suggest (and didn't) that the JSLT runtime could be defined in terms of Java interfaces. Of course, JSLT should still be usable out of the box, as it is today, and offer at least one JSON parser integration, be it Jackson or JSLT's own JSON parser.

    The main point I wanted to express is that it might be easier to hook up different JSON parsers (and potentially Avro, protobuf libraries) to the standard Java interfaces required by the runtime.


    About rolling your own JSON parser, my earlier comment was meant to question how well the different goals are being addressed by the approach:

    • Users may need to avoid clashes with other dependencies that need different versions of Jackson.

    Clearly, this goal is addressed by not using Jackson, JSLT gets out of the dependency game.

    • Potentially better performance by using a more efficient JSON representation.

    On this one, I'd venture a maybe. Yes, a focussed implementation can offer better performance than a general purpose parser. That said performance is not a static game nor one that has a single answer for all use cases.

    My concern with using this as a reason to roll your own parser is that the faster implementation today for a set of use cases may not be the fastest tomorrow or for a different set of use cases.

    So, while you may come up with a faster parser, I'm not convinced this approach properly addresses the goal.

    • Support processing binary formats such as Protobuf or Avro for higher performance.

    This goal is not addressed by switching from Jackson to a custom parser and interface. Jackson today has support for more data formats, and this support is lost.

    larsga commented 3 years ago

    I think everything you write here is totally fair.

    My plan is to make a version of JSLT which defines its own interfaces for the JSON representation.

    I also plan to make a full JSON parser and implementation of the JSON representation to bundle with JSLT.

    However, I very much want it to be possible to plug in other JSON parsers and representations, for those who prefer that. I think it would make a lot of sense to offer a separate artifact that has a Jackson binding, so that anyone who wants to use Jackson can keep doing that.

    This also means that if someone wants to try supporting Avro via Jackson that should also be possible.

    In other words, it looks to me like this should satisfy everyone?

    lukaszgendek commented 3 years ago

    In my project we use kotlinx serialization library which brings it's own JSON representation - different than Jackson. a Jackson-less JSLT would be really valuable.

    alturkovic commented 1 year ago

    Is this idea still being pursued?

    It seems like the branch own-json-interfaces has been inactive since 2021?

    larsga commented 1 year ago

    I originally started this branch because it seemed Schibsted needed a non-Jackson JSLT, and it seemed like a good idea anyway. Schibsted then expressed skepticism about this approach (see @biochimia above), and nobody else has appeared to be very interested, so I set it aside.

    I still think this could be a valuable alternative to the Jackson-based implementation, but if users are not interested then there's little point.