Open pkoppstein opened 9 years ago
The nice thing about a PR is that it makes it easy to comment on specific things.
Comments:
objectify/1
's headers
argument closure output a stream that objectify
then collects. Wherever possible we should deal in streams rather than arrays.qbe
rather than query
though.coalesce
wants a better name, I think. (Not that the SQL coalesce is applicable here, so that the name is available, but still.)innerjoin
is clever, but of course, there's no "index" here, so its performance in O(N_M), whereas it'd be nice if it was more like O(N_log(M)). Also, it ignores .
; perhaps one of the arrays could be .
and the other could be a stream argument (I notice you only use r1[]
, that one seems like the perfect one to be a stream instead of an array).Also, innerjoin
could produce a stream.
Here's an alternative innerjoin
(NOTE: NOT TESTED):
def innerjoin(t; queryobject; queryobject2key):
if type == "array" then
t as $row1 | .[] |
select(query(queryobject) == ($row1|query(queryobject))) |
map($row1 + .)
elif type == "object" then
.index as $index | .table as $dot | t as $row1 |
$index[$row1|query(queryobject)|queryobject2key] |
select(query(queryobject) == ($row1|query(queryobject))) |
map($row1 + .)
else
error("innerjoin: invalid input value type")
end;
This allows one of the tables to be a stream, outputs a stream, and allows the other table to be either an unindexed table (array) or an indexed table (object). When an indexed table is used this does O(M*log(M)), as hoped for.
Just adding a comment supporting this. I'm pretty sure I've implemented ad-hoc versions of these concepts several times, and I'd love to see them in the standard library.
@slapresta A PR would help, as would tests.
This proposal envisions the addition of four new builtin function names for the following five filters:
Note: "innerjoin/3" and its implementation are experimental; in particular, it might be better to define innerjoin(queryobject) with the input being an array [r1, r2, ...].
To illustrate the power and utility of these filters, consider the following sequence of two tasks:
(1) the transformation of a "flat" CSV or TSV file (or similar array of arrays) into a non-flat object-oriented structure;
(2) the computation of an "inner join".
Suppose we start with a CSV file containing the following data recording responses on a multiple-choice questionnaire:
Table 1
The first task is to transform this into a a JSON representation in which the responses for each respondent-question are available as a single array. That is, we want to produce an array of objects, each having the form:
{ questionid: , respondentid: , value_responses: [ ... ] }
The second task is to compute the JSON representation of the inner join of this relation with the following:
Table 2
Assuming we have read each of the flat files into an array of arrays in the obvious way, the first task can be accomplished as follows, using the proposed filters:
For the first table:
If this representation of the first table is defined as Table1 and if Table2 is the objectified representation of Table 2, then the innerjoin can be computed by:
Implementation
Converting a CSV file to a JSON array of arrays can most robustly be accomplished using a tool such as any-json, so we won't dwell on that further.
Here then are the proposed definitions of the new filters, with comments:
Example of innerjoin/3
Given:
then
innerjoin( r1; r2; {qid:null})
produces:[NOTE: The definition of query/1 has been updated, mainly to ensure that the returned object respects the ordering of keys specified by the argument (q), but also to allow q to be either an object or an array. The original definition was:
]