Proposal: innerjoin and friends

pkoppstein commented 9 years ago

This proposal envisions the addition of four new builtin function names for the following five filters:

objectify/1 -- create an object from an array of headers (interpreted as
               key names) and an array of values;

query/1     -- query one object using an object or array of strings to specify
               the keys of interest;

coalesce/1  -- combine an array of objects into a single object by 
               coalescing values (at a specified key) into an array, e.g.:

           [{id:1, "a":1}, {id:1, "a":2}] | coalesce("a")
           produces:  {"id": 1, "a": [1,2]}

coalesce/2  -- given an array of objects, a query object to be used for
               grouping, and the name of a key to be used for coalescing,
           produce an array of coalesced objects, one per group.

innerjoin(r1; r2; queryobject) -- compute the JSON analog of the "inner join"
               of r1 and r2, where these are the JSON analogs of two tables,
           and queryobject defines the join key.

Note: "innerjoin/3" and its implementation are experimental; in particular, it might be better to define innerjoin(queryobject) with the input being an array [r1, r2, ...].

To illustrate the power and utility of these filters, consider the following sequence of two tasks:

(1) the transformation of a "flat" CSV or TSV file (or similar array of arrays) into a non-flat object-oriented structure;

(2) the computation of an "inner join".

Suppose we start with a CSV file containing the following data recording responses on a multiple-choice questionnaire:

Table 1

id | question_id| respondent_id| value_response
-------------------------------| --------------
1  | 1          | 1            | SFU           
2  | 1          | 1            | UBC           
3  | 2          | 1            | BU            
4  | 2          | 1            | RI
5  | 1          | 2            | ABC
6  | 1          | 2            | DEF
7  | 2          | 2            | BU            
8  | 2          | 2            | XYZ

The first task is to transform this into a a JSON representation in which the responses for each respondent-question are available as a single array. That is, we want to produce an array of objects, each having the form:

{ questionid: , respondentid: , value_responses: [ ... ] }

The second task is to compute the JSON representation of the inner join of this relation with the following:

Table 2

question_id | question_name
---------------------------
1           | Q1
2           | Q2

Assuming we have read each of the flat files into an array of arrays in the obvious way, the first task can be accomplished as follows, using the proposed filters:

For the first table:

  .[0] as $headers
  | .[1:] | map(objectify($headers));
  | map(del(.id))
  | coalesce( { "question_id": 0, "respondent_id": 0}; "value_response" )

If this representation of the first table is defined as Table1 and if Table2 is the objectified representation of Table 2, then the innerjoin can be computed by:

  innerjoin(Table1; Table2; {question_id: 0}).

Implementation

Converting a CSV file to a JSON array of arrays can most robustly be accomplished using a tool such as any-json, so we won't dwell on that further.

Here then are the proposed definitions of the new filters, with comments:

def objectify(headers):
  . as $in
  | reduce range(0; headers|length) as $i ({}; .[headers[$i]] = $in[$i] );

# query(q) extracts details from the input object: q can either be an
# object specifying the keys of interest, or an array of strings
# specifying the keys of interest.  In all cases, the result respects
# the ordering of keys in q, even if these are repeated.
def query(q):
  . as $in
  | reduce (q | if type == "object" then keys[] else .[] end) as $k
      ({}; . + { ($k) : ($in[$k]) }) ;

# Given an array of objects with a key specified by "key", sum the
# objects using + for all keys except "key", and coalescing the values at "key":
def coalesce( key ):
  reduce .[] as $o ({}; .[key] += [$o[key]] | . + ($o|delpaths([[key]]) ) );

# Input: an array of objects
# group: an object defining a grouping criterion
# ckey: the key to coalesce (a string)
# Output: an array of objects, one per group
def coalesce(group; ckey):
  group_by( query(group) )
  | map(coalesce(ckey)) ;

# Assuming r1 and r2 are objectified representations of relational
# tables, innerjoin(r1;r2;q) computes the objectifed representation of
# their innerjoin, using q as the query object defining the join-key.
# (An example is given below.)
def innerjoin(r1; r2; queryobject):
  reduce r1[] as $row1 ([];
    (r2 | map(select( query(queryobject) == ($row1|query(queryobject)) ))) as $row2
    | . + ($row2 | map( $row1 + . )) );

Example of innerjoin/3

Given:

def r1:
  [ {qid:1, qname: "Q1"}, {qid:2, qname: "Q2"}];

def r2:
  [ {student_id: "A", qid: 1},
    {student_id: "A", qid: 2},
    {student_id: "B", qid: 1},
    {student_id: "B", qid: 2}];

then innerjoin( r1; r2; {qid:null}) produces:

[ {"qid":1,"qname":"Q1","student_id":"A"},
  {"qid":1,"qname":"Q1","student_id":"B"},
  {"qid":2,"qname":"Q2","student_id":"A"},
  {"qid":2,"qname":"Q2","student_id":"B"} ]

[NOTE: The definition of query/1 has been updated, mainly to ensure that the returned object respects the ordering of keys specified by the argument (q), but also to allow q to be either an object or an array. The original definition was:

def query(object): with_entries( select( .key as $key | object | has( $key ) ));

]

nicowilliams commented 9 years ago

The nice thing about a PR is that it makes it easy to comment on specific things.

Comments:

I'd rather that objectify/1's headers argument closure output a stream that objectify then collects. Wherever possible we should deal in streams rather than arrays.
The kind of queries you're proposing is commonly called "query by example" (QBE). Not sure that I'd propose calling it qbe rather than query though.
coalesce wants a better name, I think. (Not that the SQL coalesce is applicable here, so that the name is available, but still.)
innerjoin is clever, but of course, there's no "index" here, so its performance in O(N_M), whereas it'd be nice if it was more like O(N_log(M)). Also, it ignores .; perhaps one of the arrays could be . and the other could be a stream argument (I notice you only use r1[], that one seems like the perfect one to be a stream instead of an array).

nicowilliams commented 9 years ago

Also, innerjoin could produce a stream.

Here's an alternative innerjoin (NOTE: NOT TESTED):

def innerjoin(t; queryobject; queryobject2key):
  if type == "array" then
    t as $row1 | .[] |
    select(query(queryobject) == ($row1|query(queryobject))) |
    map($row1 + .)
  elif type == "object" then
    .index as $index | .table as $dot | t as $row1 |
    $index[$row1|query(queryobject)|queryobject2key] |
    select(query(queryobject) == ($row1|query(queryobject))) |
    map($row1 + .)
  else
    error("innerjoin: invalid input value type")
  end;

This allows one of the tables to be a stream, outputs a stream, and allows the other table to be either an unindexed table (array) or an indexed table (object). When an indexed table is used this does O(M*log(M)), as hoped for.

ghost commented 8 years ago

Just adding a comment supporting this. I'm pretty sure I've implemented ad-hoc versions of these concepts several times, and I'd love to see them in the standard library.

nicowilliams commented 8 years ago

@slapresta A PR would help, as would tests.

jqlang / jq

Proposal: innerjoin and friends #975

Implementation

Example of innerjoin/3