jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.48k stars 1.58k forks source link

enhancement request: sigma, count, correlation #457

Open pkoppstein opened 10 years ago

pkoppstein commented 10 years ago

Three function names are introduced here:

The functions are as follows, where f and g are filters, and na is a distinguished "not available" value. For correlation and sigma, the input is an array; count ignores its input.

Given a generator, count how many times the condition is true; condition should use . to refer to a generated value. Example: count( range(0;100); . * (.|exp) < 100)

def count(generator; condition):
  reduce generator as $x (0; if ($x | condition) then .+1 else . end);

Given a generator, count how many times the condition is true but stop testing the condition and incrementing the counter once the counter is equal to max; condition should use . to refer to a generated value. count/3 is useful, for example, if we only need to know whether there are enough values to proceed.

def count(generator; condition; max):
  reduce generator as $x (0; if . < max and ($x | condition) then .+1 else . end);

sigma(f) computes the sum of values extracted from the entities in the input array. Example: sigma( .height * .weight)

def sigma( f ): reduce .[] as $o (0; . + ($o | f )) ;

correlation(f;g) computes the correlation coefficient between two variables, avoiding overflow, and computing N based on non-null pairs; if either f or g references a non-number other than null, then use correlation(f;g;na). Example: correlation( .height; .weight )

def correlation(f; g):
  count( .[]; (f != null) and (g != null) ) as $n
  | (sigma( f ) / $n) as $af
  | (sigma( g ) / $n) as $ag
  | (sigma( if f == null or g == null then 0 else (f - $af) * (g - $ag) end )) as $prod 
  | (sigma( if f == null or g == null then 0 else (f - $af) | . * .     end )) as $ssf
  | (sigma( if f == null or g == null then 0 else (g - $ag) | . * .     end )) as $ssg
  | ( $prod / (($ssf * $ssg) | sqrt ) ) ;

correlation(f;g;na) computes the correlation coefficient between f and g using correlation(f;g) after filtering out non-numbers and the value na. This implementation avoids the overhead of "map(select())" if possible.

def correlation(f; g; na):
  count( .[]; f != na and g != na and "number" == (f|type) and "number" == (g|type) ) 
    as $count
  | if $count == 0 then . 
    else map(select( f != na and g != na and "number" == (f|type) and "number" == (g|type)))
    end
  | correlation(f; g) ;

[2014.08.04: EDITED def correlation]

nicowilliams commented 10 years ago

BTW, I'll get to all your contributions soon... Sorry for the delay! Keep it up!

On Tue, Jul 1, 2014 at 11:57 PM, pkoppstein notifications@github.com wrote:

Three function names are introduced here:

  • "count" -- given a generator, count the number of times a condition is met, optionally up to some maximum;
  • "sigma" -- given a filter that extracts values from each entity in the array input, compute the sum of the values;
  • "correlation" -- compute a correlation coefficient, taking into account null and possibly missing values.

The functions are as follows, where f and g are filters, and na is a distinguished "not available" value. For correlation and sigma, the input is an array; count ignores its input.

  • correlation(f;g) and correlation(f;g;na)
  • sigma(f)
  • count(generator;condition) and count(generator;condition;max)

Given a generator, count how many times the condition is true; condition should use . to refer to a generated value. Example: count( range(0;100); . * (.|exp) < 100)

def count(generator; condition): reduce generator as $x (0; if ($x | condition) then .+1 else . end);

Given a generator, count how many times the condition is true but stop testing the condition and incrementing the counter once the counter is equal to max; condition should use . to refer to a generated value. count/3 is useful, for example, if we only need to know whether there are enough values to proceed.

def count(generator; condition; max): reduce generator as $x (0; if . < max and ($x | condition) then .+1 else . end);

sigma(f) computes the sum of values extracted from the entities in the input array. Example: sigma( .height * .weight)

def sigma( f ): reduce .[] as $o (0; . + ($o | f )) ;

correlation(f;g) computes the correlation coefficient between two variables, avoiding overflow, and computing N based on non-null pairs; if either f or g references a non-number other than null, then use correlation(f;g;na). Example: correlation( .height; .weight )

def correlation(f; g): count( .[]; (f != null) and (g != null) ) as $n | (sigma( f ) / $n) as $af | (sigma( g ) / $n) as $ag | (sigma( (f - $af) * (g - $ag))) as $prod | (sigma( (f - $af) * (f - $af))) as $ssf | (sigma( (g - $ag) * (g - $ag))) as $ssg | ( $prod / (($ssf * $ssg) | sqrt ) ) ;

correlation(f;g;na) computes the correlation coefficient between f and g using correlation(f;g) after filtering out non-numbers and the value na. This implementation avoids the overhead of "map(select())" if possible.

def correlation(f; g; na): count( .[]; f != na and g != na and "number" == (f|type) and "number" == (g|type) ) as $count | if $count == 0 then . else map(select( f != na and g != na and "number" == (f|type) and "number" == (g|type))) end | correlation(f; g) ;

— Reply to this email directly or view it on GitHub https://github.com/stedolan/jq/issues/457.

nicowilliams commented 10 years ago

@pkoppstein Some comments:

0) Can you give me a PR? Please include updatges for the manual and test cases. 1) count/3 should be implemented in terms of the new foreach instead of reduce. This is important: otherwise count/3 will iterate over all of the generator's outputs, even once it's reached the maximum, but with foreach you can terminate iteration over the generator's outputs (see the new limit as well). 2) Maybe sigma should have a sigma/2 version that takes a generator and computes over its outputs instead of iterating over an input array.

Send me a PR with these comments addressed and I'll merge it :)

stedolan commented 10 years ago

@pkoppstein Thanks for pointing out a definite lack in jq's builtins. Basics stats functions are really useful, and we're sorely lacking them at the moment.

I have a bunch of issues with the code you posted. I don't want to put you off, because you've found a problem in jq and proposed a fix and I want you to hang around and keep doing that so that I can continue being lazy :)

So, disclaimers aside, here goes:

nicowilliams commented 10 years ago

Some such builtins might belong best in a wiki page. But remember @stedolan, we now have an import facility and we could just have a builtin module or three into which to place a large variety of builtins that we don't necessarily want in the main module (because we don't want to compile them unnecessarily and because we want to not "pollute" the main documentation).

wtlangford commented 10 years ago

@nicowilliams definitely hits on one of the main benefits to having the module system - We can now add builtins with special purpose without a) polluting the main namespace, b) flooding the docs, and c) compiling less-commonly-used definitions unless needed.

n.b. that doing this would require modifications to the install process (creating a directory for them) as well as an addition to the module system adding that path to the end of the search chain. On that note, if jq gets installed in non-standard locations, we need to be able to react to that in the module system code. Food for thought.

nicowilliams commented 10 years ago

@wtlangford Well, we could have builtin modules that you must import to use but which are compiled into the jq executable. We might ship lots of modules and only a few builtin modules, in which case if you want the non-builtin ones you need to install the whole thing, not just download the executable.

pkoppstein commented 10 years ago

Thanks, @stedolan and @nicowilliams, for your comments. You've raised a wide variety of issues, ranging from technical details to larger questions about how to partition things between builtin.c, a Statistics module, and documentation, and I'd like to discuss some of these issues with you, especially from the perspective of a jq user and statistician.

(Given what they say about teaching statistics, it's no boast to say that I've done so, but more relevantly, perhaps, I work with large datasets (including large JSON datasets) every day -- that's how I came to jq. Let's call this role the "data practitioner's role". Since jq is, as I understand it, largely intended for people working with JSON data, I trust you'll agree this perspective is not out of place.)

First, however, I'd like to make two general points that may be relevant to the "what belongs where" discussion:

(Of course, if there is an existing package manager that can (or could be tailored) to do the job, so much the better.)

So let's start with nothing.

From @stedolan's comment about correlation/3 (with "na"), it seems that I can assume correlation/2 and it's handling of null are basically fine. Great. So what to do with correlation/3?

I agree that correlation/3's optimization is fairly obvious, and no doubt a data practitioner would expect correlation/3 to handle "na" efficiently (like correlation/2 handles null), so perhaps the thing to do is to put the optimized version of correlation/3 in a Stats module. Would that be acceptable?

Next up:

is sigma(f) really necessary? It seems to be equivalent to map(f) | add (except that add returns null on an empty list)

Let me refer to the proposed sigma here as SIGMA, as I want to separate the naming issue from the inclusion issue.

There were five reasons for including SIGMA(f):

I still think these apply. In other words, I think SIGMA qualifies as a basic building block in @stedolan's sense. (In fact, I didn't include average/1 mainly because SIGMA is the right building block for computing averages, too.)

As for the naming of SIGMA:

Finally, there's count(generator; condition). Again, I included this not so much to avoid one "map" as to avoid constructing an array (see "space efficiency" above). That is, I can simply write (count (.foo == "bar") against an array of objects without having to worry about the size of the array at all. Efficient with space, efficient with time.

In summary, the immediate questions are:

1) Naming of SIGMA

2) Which of correlation/2, correlation/3, SIGMA, and count (if any) can go in builtin.c" (without "Stats::")?

3) Should any of these go in builtin.c with a "Stats::" prefix?

As an avid jq user, I'd like to see at least count and SIGMA (by whatever name) in builtin.c, with the fastest possible implementations possible, it being understood that the implementation will probably vary from release to release, subject to the functional specification remaining unchanged.

Sorry for the long post, but there is an interesting intersection (collision?) of topics here.

pkoppstein commented 10 years ago

@nicowilliams wrote:

1) count/3 should be implemented in terms of the new foreach instead of reduce.

Yes, except I think the best approach is to use both.

Here's what I've come up with:

def count(generator; condition; max):
  reduce (foreach generator as $x
                (0; if . < max then  if ($x | condition) then .+1 else . end
                     else break
                     end;  .) ) as $i ( 0; $i );

A little tricky to read at first, but it gets the job done.

pkoppstein commented 10 years ago

@nicowilliams suggested:

Maybe sigma should have a sigma/2 version

No problem. For anyone who reads

Σ f
g

as something "sigma f over g", SIGMA(f;g) makes sense, and there are filters with generators in second-position, so I don't see any reason not to use that ordering. OK?