tomnomnom / gron

Make JSON greppable!
MIT License
13.76k stars 326 forks source link

Adds JSON encoded data format #42

Closed csabahenk closed 6 years ago

csabahenk commented 6 years ago

Introducing the -j, --json switch:

> gron -j <<<  '{"a":{"b":["c"]}}'  | tee out.jgron          
[[],{}]                 
[["a"],{}]              
[["a","b"],[]]          
[["a","b",0],"c"] 
> cat out.jgron  | gron -j -u                                            
{                       
  "a": {                
    "b": [              
       "c"              
    ]                   
  }                     
}                       
tomnomnom commented 6 years ago

Thanks for your PR, @csabahenk!

I think this will take me a little while to review. In the meantime, can you tell me a bit more about the output format and what it's good for? I'm finding it a little difficult to see what it's good for / why I might want it

csabahenk commented 6 years ago

I updated the commits to amend some suboptimal allocations. I'll provide rationale / use case later.

csabahenk commented 6 years ago

So this feature comes handy when dealing with JSON of unlimited depth (nesting level), or "deep JSON".

Terminology: in the kind of Javascript expression that gron emits I call the left hand side path, the right hand side value. (I don't know if you've settled with something particular in this regard.)

With deep JSON the following characteristics emerge:

Two problems arise with Javascript expression format that's used by gron:

Both of these are overcame with the alternative format implemented by this PR, that is, [<array of path components>, <value>]. Here brackets occur predictably and only at the edges of the path expression. And this being JSON, the processing program can just call a JSON parser on each line of the input; and a JSON generator to produce output that can be fed back to ungron.


At this point I'd like to stop being theoretical and provide an example of this. However, it's not so easy to find a not overly contrived example, because most JSON you find out in the wild represents an API or certain kind of objects, which have a fixed layout that puts an upper bound on the depth.

So I 'm putting forward the actual use case I have. I'm mangling HTML. I have a tool called HTMLi that parses HTML and can output a JSON representation of it, and can convert back the JSON to HTML. Please give it a go. I intended the JSON representation to be the means to extract data from HTML, but it was only by finding out about gron that I really felt empowered to do it. However, HTML, and thus the JSON representing it is deep, and I bumped into the aforementioned problems. And thus I came up with this idea of JSON encoding of gron data. So here is an example of a "transformer" that was made easy to write by using the JSON stream encoded gron data.

The gron-tag-extract.rb script reads and emits JSON stream encoded gron data. It takes a list of HTML tags as arguments, and what it does: all the occurrences of those tags in the input document are collected and assembled into an ordered list (<ol>), with indication of their path in the document (in the sense of the HTMLi/gron representation).

Here is the script:

#!/usr/bin/env ruby

require "json"

pathrep = proc { |k| k[2..-1].join("-") }
emit = proc{ |*a| puts a.to_json }

tags = $*

tagmap = {}
tagstage = {}

STDIN.each do |line|
  path,val = JSON.parse line
  if val == {}
    tagstage[path]=[]
  end
  tagstage.each { |spath,vlist|
    if path[0...spath.size] == spath
      vlist << [path[spath.size..-1],val]
      if path.size == spath.size+1 and tags.include? path[-1]
        tagmap[spath+path[-1..-1]] = vlist
      end
    else
      tagstage.delete spath
    end
  }
end

tagmap.each_with_index do |rec,i|
  path,vlist = rec
  emit.call ["ol",i,"li",0], pathrep[path] + ": "
  vlist.each { |val|
    emit.call ["ol",i,"li",1,] + val[0], val[1]
  }
  emit.call ["ol",i,"typ"],"tag"
end
emit.call ["typ"],"tag"

Try this to get all links of golang.org:

$ curl -s https://golang.org | htmli.rb -t json | gron -j | \
  gron-tag-extract.rb a | \
  gron -u -j | htmli.rb -f json -c 1

(Note that htmli.rb gives up on dangling closing tags and it turned out that well respected sites like Github and json.org contains such nastiness, so you can't pull this trick just anywhere.)


And it's true that such a script could be written without bringing gron to the table, working with the original JSON, and it wouldn't be any bigger deal: just the JSON tree would need to be traversed, keeping a record of the path, and when (a node representing) a matching tag is found, add it to the collection. And in general you might argue that at the point of creating transformers that are more complex than just doing per-line processing, one is better off without gron and go for the traversal thingy. (And then conclude that -j is not much more than an encouragement for an anti-pattern.)

And yes, it's not that gron would allow us to write programs that process JSON in novel ways. What gron does is that it allows one to make sense of the data at hand by just looking at it, and gives rise to an explorative and interactive approach for cobbling together the data processing code. One looks and adjusts step by step -- rather than having to start off with the initial investment of creating the traversal routine, putting in the effort to settle with what kind context is to be threaded through. The tag extraction code cited above is also the result of this incremental process, started out from grepping for this and that. So it's not the case that gron based programs would process JSON in novel ways -- indeed, with gron the programs that process JSON can be written in novel ways. And in this vision -j can also be deemed worthy.

tomnomnom commented 6 years ago

Hey! Just to let you know that I'm planning to review this this week. It's a hefty change, but you put forward a very good case for merging it. Thank you!

csabahenk commented 6 years ago

Would you like me to resolve the conflict and update the PR?

tomnomnom commented 6 years ago

@csabahenk I've finally had a chance to have a look through everything! Yes, if you fix the conflicts I will merge it :) Thank you!

csabahenk commented 6 years ago

@tomnomnom all conflicts are reduced to ashes!

tomnomnom commented 6 years ago

Merged! Thank you so much for your contribution :)