schibsted / jslt

JSON query and transformation language
Apache License 2.0
620 stars 119 forks source link

equivalent to jolt recursivelySquashNulls #340

Open samer1977 opened 1 month ago

samer1977 commented 1 month ago

Hi,

Im coming from jolt background and now finding myself to learn jslt because apache nifi introduced new json transformation using jslt and I'm interested in learning to see if I can get the best of both world. Its totally different mind set but I can see how close its to Xquery in xml. I'm surprised that no one has asked this because this is common problem in json transformation when you want to get rid of all null values. Jolt has created function called recursivelySquashNulls that will remove all nulls in nested json recursively but I could not find something similar in jslt. Can someone please write me the spec for it in jslt? I spent the whole day trying to figure it out but its not that easy specially when your nested object is either complex object or array of complex object or even array of simple types. I would like to see if jslt can address all scenarios in not so much convoluted spec.

Thanks

samer1977 commented 1 month ago

I asked this question above but no answer which Im not sure why. If you come from Jolt background you have used this function and though it doenst work perfectly it helps sometimes and its good to have as an option. I started learning JSLT couple of days ago and it caught my interest. I can see cases where jstl can be better option than jolt and might simplify things. Performance I'm not sure though, I made comparison using Nifi and ran both spec on the same input to produce the same output and jolt always had the a little bit of edge. Regarding the above question here is what I was able to come up with and I hope I was successful:

def squashNullsRecursive(obj)

   let simple = { for($obj) .key: .value if (.value!=null and not(is-object(.value)))}
   let complex = { for($obj) .key: squashNullsRecursive(.value) if (is-object(.value)) }
   let array = { for($obj) .key: [for(.value) . if (not(is-object(.)) and .!=null)] +
                                 [for(.value) squashNullsRecursive(.) if (is-object(.) or is-array(.))] 

                 if (is-array(.value))
               }

   $array +$complex+$simple 

Input:


{
  "x": "x1",
  "y": "y2",
  "z": {
    "z1": "z11",
    "z2": null,
    "z3": [
      1,
      {
        "zzz": "skid",
        "zzz1": null
      },
      2
    ]
  }
}

squashNullsRecursive(.)

larsga commented 1 month ago

I didn't answer because I don't have time to write this function from scratch.

You're on the right trick, but in the top level of your function I'd use if and test the input for is-array and is-object to separate the cases: object, array, something else. You can write it much more simply and cleanly that way.

samer1977 commented 1 month ago

Can you please give an example for the simplification. Im not sure what you mean by if and test. Thanks

larsga commented 1 month ago

You know what an if statement is, right? What's inside the () is the test.

samer1977 commented 1 month ago

Sorry I still dont get it. I thought Im using if statement with For loop and I thought this is the clean way per documentation. I know what if statement is. I might be slow and not as smart as you are but I know I can write better flatten-object than yours ;)

catull commented 1 month ago

This one works

def squashNulls (obj)
  from-json (replace (replace (replace (replace (string ($obj), "\\\"[^\"]+\\\":null", ""), ",,", ","), ",}", "}"), ",]", "]"))

squashNulls (.)

It could be reduced to a simpler replace, if that function supported positional patterns.

The last two replacement patterns can be collapsed into "," followed by either } or ] to replace (s, ",([}]])", "$1") or replace (s, ",([}]])", "\1") or replace (s, ",([}]])", "&1") or

Or whichever mechanism there is. What is used underneath replace, is it plain Java ?

It works on RegexPlanet, see https://www.regexplanet.com/share/index.html?share=yyyyf6v7w2d Click on 'Java'.

I am aware this is not what @samer1977 asked for.

catull commented 1 month ago

I checked, see https://github.com/schibsted/jslt/blob/master/core/src/main/java/com/schibsted/spt/data/jslt/impl/BuiltinFunctions.java#L931

Java Regexp Pattern are used internally, but they do not support positional patterns. I'll open an issue for that.

samer1977 commented 1 month ago

string replace? that looks scary from performance perspective but I guess I need to do some testing and find out

catull commented 1 month ago

My original algorithm did not support an initial property of an object being null. It only worked if the null property was in the middle or the end.

string replace? that looks scary from performance perspective but I guess I need to do some testing and find out I also got rid of two nested replace calls, from 4 calls to 2.

Better performance, right ?

This one does now support initial nulls in objects:

def squashNulls (obj)
  from-json (
    replace (
      replace (
          string ($obj),
          ",?\\\"[^\"]+\\\":null",
          ""
      ),
      "\\{,",
      "{"
    )
  )

squashNulls (.)

Tested on this input:

{
  "w": null,
  "x": "x1",
  "y": "y2",
  "z": {
    "z1": "z11",
    "z2": null,
    "z3": [
      1,
      {
        "zzz": "skid",
        "zzz1": null
      },
      2
    ]
  }
}
catull commented 1 month ago

@samer1977 By the way, what is the policy on null values in arrays ?

[ null, 2, 7, { "a": 1, "b": 2 }]

Should that null be dropped ? The size of the array would change.

My algorithm only drops attributes that are null. They do not change the objects, i.e. I consider the objects { "a": null, "b": 7 } and { "b": 7 } structurally equivalent.