ER: distinct(s) - an efficient, stream-oriented counterpart to `unique/0`

pkoppstein commented 6 years ago

The existing unique builtin currently sorts its input, and only works on arrays. Since it has been in jq for a long while, it is unlikely to change in either respect. Hence this proposal. A reference implementation would be:

def distinct(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x) |.[];

Note that this is intended to exploit jq's optimization of key lookups without risking collisions.

One important use-case for such a function is in connection with jq's support for "relational" (SQLish) functionality: if the jq builtins ensure uniqueness, then they could use distinct(s); otherwise, users may want to do so.

An alternative implementation approach would be to define the dictionary-creation function separately:

def dict(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);
def distinct(s): dict(s)[];

If dict/1 or equivalent is included in builtin.jq, then it would be appropriate to make it easy to check whether an entity is already in the dictionary, e.g.

# input: a dictionary using `type[0:1] + tostring`-style keys
def lookup($item): .[$item | (type[0:1] + tostring)];

nicowilliams commented 6 years ago

Nice.

Let's not have dict(s) but set(s). When we as coroutines we can add a proper dict.

pkoppstein commented 6 years ago

As previously mentioned, I'm not in a position to offer a PR, but here are some pertinent bits and pieces.

builtin.jq

# emit a set-representation of items in the stream, s
def set(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);

# emit a stream of the distinct entities in the stream s, without sorting them
def distinct(s): set(s)[];

# input: a dictionary using keys as produced by set/1
def lookup($item): .[$item | (type[0:1] + tostring)];

jq.test

[distinct(1,1,"1", [1,1],[1,1])] null [1,"1",[1,1]]

set({a:1}, {a:1}, {a:2}) | lookup({a:1}) null {"a":1}

nicowilliams commented 6 years ago

Yes, that works. For a proper dict, what I want is to add co-routines then do something like:

def dict(keys; values):
    def pairs(keys; values):
        def pairs: (.,[first(keys),first(values)])|select(length == 2)|pairs; pairs;
    reduce (pairs(@@keys; @@values)) as .[$key, $value] ({}; .[$key|(type[0:1] + tostring)] = $value);

Then your lookup can be used for sets and dicts.

fadado commented 6 years ago

Do not forget infinite streams:

def distinct(stream):
    foreach stream as $x (
        {};
        ($x|type[0:1]+tostring) as $k
            | if has($k)
              then .["~"]=false
              else .[$k]=$x | .["~"]=true end;
        select(.["~"]) | $x)
;

odnoletkov commented 6 years ago

I think distinct makes perfect sense for infinite streams just like uniq does. It should just remove duplicates and this could be done in a streaming fashion. Of course it would not work well for streams with mostly distinct items. But in other cases it could be very efficient and useful.

leonid-s-usov commented 6 years ago

@pkoppstein I think that @odnoletkov is right and there shouldn't be any restriction on using the distinct filter even with infinite streams. It would be a way to shoot oneself into the leg, but just as wasteful would it be to attempt build an array with a "conceptually infinite stream" truly infinite streams are in any case quite specific tool, and one using those will probably have some clear plan on how to use it and whether one or another filter would apply.

odnoletkov commented 6 years ago

To illustrate my point on the infinite streams

$ jq -n 'distinct(0 | recurse(. + 1))' | head -5

I believe @fadado version would would produce expected results and won't hang

pkoppstein commented 6 years ago

@fadado - Good point!

However, I would propose saving space by not storing $x, and by avoiding the construction of a new string if $x is already a string:

def distinct(s):
  foreach s as $x ({};
    ($x|tostring) as $k
    | if .[$x|type] | has($k) then .ok = false else .[$x|type][$k] = true | .ok = true end;
    select(.ok) | $x);

@odnoletkov, @leonid-s-usov - Sorry for the mixup on my part.

jqlang / jq

ER: distinct(s) - an efficient, stream-oriented counterpart to `unique/0` #1555

builtin.jq

jq.test