Open pkoppstein opened 6 years ago
Nice.
Let's not have dict(s)
but set(s)
. When we as coroutines we can add a proper dict.
As previously mentioned, I'm not in a position to offer a PR, but here are some pertinent bits and pieces.
# emit a set-representation of items in the stream, s
def set(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);
# emit a stream of the distinct entities in the stream s, without sorting them
def distinct(s): set(s)[];
# input: a dictionary using keys as produced by set/1
def lookup($item): .[$item | (type[0:1] + tostring)];
[distinct(1,1,"1", [1,1],[1,1])] null [1,"1",[1,1]]
set({a:1}, {a:1}, {a:2}) | lookup({a:1}) null {"a":1}
Yes, that works. For a proper dict
, what I want is to add co-routines then do something like:
def dict(keys; values):
def pairs(keys; values):
def pairs: (.,[first(keys),first(values)])|select(length == 2)|pairs; pairs;
reduce (pairs(@@keys; @@values)) as .[$key, $value] ({}; .[$key|(type[0:1] + tostring)] = $value);
Then your lookup
can be used for sets and dicts.
Do not forget infinite streams:
def distinct(stream):
foreach stream as $x (
{};
($x|type[0:1]+tostring) as $k
| if has($k)
then .["~"]=false
else .[$k]=$x | .["~"]=true end;
select(.["~"]) | $x)
;
I think distinct
makes perfect sense for infinite streams just like uniq
does. It should just remove duplicates and this could be done in a streaming fashion. Of course it would not work well for streams with mostly distinct items. But in other cases it could be very efficient and useful.
@pkoppstein I think that @odnoletkov is right and there shouldn't be any restriction on using the distinct filter even with infinite streams. It would be a way to shoot oneself into the leg, but just as wasteful would it be to attempt build an array with a "conceptually infinite stream" truly infinite streams are in any case quite specific tool, and one using those will probably have some clear plan on how to use it and whether one or another filter would apply.
To illustrate my point on the infinite streams
$ jq -n 'distinct(0 | recurse(. + 1))' | head -5
I believe @fadado version would would produce expected results and won't hang
@fadado - Good point!
However, I would propose saving space by not storing $x, and by avoiding the construction of a new string if $x is already a string:
def distinct(s):
foreach s as $x ({};
($x|tostring) as $k
| if .[$x|type] | has($k) then .ok = false else .[$x|type][$k] = true | .ok = true end;
select(.ok) | $x);
@odnoletkov, @leonid-s-usov - Sorry for the mixup on my part.
The existing
unique
builtin currently sorts its input, and only works on arrays. Since it has been in jq for a long while, it is unlikely to change in either respect. Hence this proposal. A reference implementation would be:Note that this is intended to exploit jq's optimization of key lookups without risking collisions.
One important use-case for such a function is in connection with jq's support for "relational" (SQLish) functionality: if the jq builtins ensure uniqueness, then they could use distinct(s); otherwise, users may want to do so.
An alternative implementation approach would be to define the dictionary-creation function separately:
If
dict/1
or equivalent is included in builtin.jq, then it would be appropriate to make it easy to check whether an entity is already in the dictionary, e.g.