jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.32k stars 1.57k forks source link

Proposal: innerjoin and friends #975

Open pkoppstein opened 9 years ago

pkoppstein commented 9 years ago

This proposal envisions the addition of four new builtin function names for the following five filters:

objectify/1 -- create an object from an array of headers (interpreted as
               key names) and an array of values;

query/1     -- query one object using an object or array of strings to specify
               the keys of interest;

coalesce/1  -- combine an array of objects into a single object by 
               coalescing values (at a specified key) into an array, e.g.:

           [{id:1, "a":1}, {id:1, "a":2}] | coalesce("a")
           produces:  {"id": 1, "a": [1,2]}

coalesce/2  -- given an array of objects, a query object to be used for
               grouping, and the name of a key to be used for coalescing,
           produce an array of coalesced objects, one per group.

innerjoin(r1; r2; queryobject) -- compute the JSON analog of the "inner join"
               of r1 and r2, where these are the JSON analogs of two tables,
           and queryobject defines the join key.

Note: "innerjoin/3" and its implementation are experimental; in particular, it might be better to define innerjoin(queryobject) with the input being an array [r1, r2, ...].

To illustrate the power and utility of these filters, consider the following sequence of two tasks:

(1) the transformation of a "flat" CSV or TSV file (or similar array of arrays) into a non-flat object-oriented structure;

(2) the computation of an "inner join".

Suppose we start with a CSV file containing the following data recording responses on a multiple-choice questionnaire:

Table 1

id | question_id| respondent_id| value_response
-------------------------------| --------------
1  | 1          | 1            | SFU           
2  | 1          | 1            | UBC           
3  | 2          | 1            | BU            
4  | 2          | 1            | RI
5  | 1          | 2            | ABC
6  | 1          | 2            | DEF
7  | 2          | 2            | BU            
8  | 2          | 2            | XYZ

The first task is to transform this into a a JSON representation in which the responses for each respondent-question are available as a single array. That is, we want to produce an array of objects, each having the form:

{ questionid: , respondentid: , value_responses: [ ... ] }

The second task is to compute the JSON representation of the inner join of this relation with the following:

Table 2

question_id | question_name
---------------------------
1           | Q1
2           | Q2

Assuming we have read each of the flat files into an array of arrays in the obvious way, the first task can be accomplished as follows, using the proposed filters:

For the first table:

  .[0] as $headers
  | .[1:] | map(objectify($headers));
  | map(del(.id))
  | coalesce( { "question_id": 0, "respondent_id": 0}; "value_response" )

If this representation of the first table is defined as Table1 and if Table2 is the objectified representation of Table 2, then the innerjoin can be computed by:

  innerjoin(Table1; Table2; {question_id: 0}).

Implementation

Converting a CSV file to a JSON array of arrays can most robustly be accomplished using a tool such as any-json, so we won't dwell on that further.

Here then are the proposed definitions of the new filters, with comments:

def objectify(headers):
  . as $in
  | reduce range(0; headers|length) as $i ({}; .[headers[$i]] = $in[$i] );

# query(q) extracts details from the input object: q can either be an
# object specifying the keys of interest, or an array of strings
# specifying the keys of interest.  In all cases, the result respects
# the ordering of keys in q, even if these are repeated.
def query(q):
  . as $in
  | reduce (q | if type == "object" then keys[] else .[] end) as $k
      ({}; . + { ($k) : ($in[$k]) }) ;

# Given an array of objects with a key specified by "key", sum the
# objects using + for all keys except "key", and coalescing the values at "key":
def coalesce( key ):
  reduce .[] as $o ({}; .[key] += [$o[key]] | . + ($o|delpaths([[key]]) ) );

# Input: an array of objects
# group: an object defining a grouping criterion
# ckey: the key to coalesce (a string)
# Output: an array of objects, one per group
def coalesce(group; ckey):
  group_by( query(group) )
  | map(coalesce(ckey)) ;

# Assuming r1 and r2 are objectified representations of relational
# tables, innerjoin(r1;r2;q) computes the objectifed representation of
# their innerjoin, using q as the query object defining the join-key.
# (An example is given below.)
def innerjoin(r1; r2; queryobject):
  reduce r1[] as $row1 ([];
    (r2 | map(select( query(queryobject) == ($row1|query(queryobject)) ))) as $row2
    | . + ($row2 | map( $row1 + . )) );

Example of innerjoin/3

Given:

def r1:
  [ {qid:1, qname: "Q1"}, {qid:2, qname: "Q2"}];

def r2:
  [ {student_id: "A", qid: 1},
    {student_id: "A", qid: 2},
    {student_id: "B", qid: 1},
    {student_id: "B", qid: 2}];

then innerjoin( r1; r2; {qid:null}) produces:

[ {"qid":1,"qname":"Q1","student_id":"A"},
  {"qid":1,"qname":"Q1","student_id":"B"},
  {"qid":2,"qname":"Q2","student_id":"A"},
  {"qid":2,"qname":"Q2","student_id":"B"} ]

[NOTE: The definition of query/1 has been updated, mainly to ensure that the returned object respects the ordering of keys specified by the argument (q), but also to allow q to be either an object or an array. The original definition was:

def query(object): with_entries( select( .key as $key | object | has( $key ) ));

]

nicowilliams commented 9 years ago

The nice thing about a PR is that it makes it easy to comment on specific things.

Comments:

nicowilliams commented 9 years ago

Also, innerjoin could produce a stream.

Here's an alternative innerjoin (NOTE: NOT TESTED):

def innerjoin(t; queryobject; queryobject2key):
  if type == "array" then
    t as $row1 | .[] |
    select(query(queryobject) == ($row1|query(queryobject))) |
    map($row1 + .)
  elif type == "object" then
    .index as $index | .table as $dot | t as $row1 |
    $index[$row1|query(queryobject)|queryobject2key] |
    select(query(queryobject) == ($row1|query(queryobject))) |
    map($row1 + .)
  else
    error("innerjoin: invalid input value type")
  end;

This allows one of the tables to be a stream, outputs a stream, and allows the other table to be either an unindexed table (array) or an indexed table (object). When an indexed table is used this does O(M*log(M)), as hoped for.

ghost commented 8 years ago

Just adding a comment supporting this. I'm pretty sure I've implemented ad-hoc versions of these concepts several times, and I'd love to see them in the standard library.

nicowilliams commented 8 years ago

@slapresta A PR would help, as would tests.