enhancement request: sigma, count, correlation

pkoppstein commented 10 years ago

Three function names are introduced here:

"count" -- given a generator, count the number of times a condition is met, optionally up to some maximum;
"sigma" -- given a filter that extracts values from each entity in the array input, compute the sum of the values;
"correlation" -- compute a correlation coefficient, taking into account null and possibly missing values.

The functions are as follows, where f and g are filters, and na is a distinguished "not available" value. For correlation and sigma, the input is an array; count ignores its input.

correlation(f;g) and correlation(f;g;na)
sigma(f)
count(generator;condition) and count(generator;condition;max)

Given a generator, count how many times the condition is true; condition should use . to refer to a generated value. Example: count( range(0;100); . * (.|exp) < 100)

def count(generator; condition):
  reduce generator as $x (0; if ($x | condition) then .+1 else . end);

Given a generator, count how many times the condition is true but stop testing the condition and incrementing the counter once the counter is equal to max; condition should use . to refer to a generated value. count/3 is useful, for example, if we only need to know whether there are enough values to proceed.

def count(generator; condition; max):
  reduce generator as $x (0; if . < max and ($x | condition) then .+1 else . end);

sigma(f) computes the sum of values extracted from the entities in the input array. Example: sigma( .height * .weight)

def sigma( f ): reduce .[] as $o (0; . + ($o | f )) ;

correlation(f;g) computes the correlation coefficient between two variables, avoiding overflow, and computing N based on non-null pairs; if either f or g references a non-number other than null, then use correlation(f;g;na). Example: correlation( .height; .weight )
def correlation(f; g):
  count( .[]; (f != null) and (g != null) ) as $n
  | (sigma( f ) / $n) as $af
  | (sigma( g ) / $n) as $ag
  | (sigma( if f == null or g == null then 0 else (f - $af) * (g - $ag) end )) as $prod 
  | (sigma( if f == null or g == null then 0 else (f - $af) | . * .     end )) as $ssf
  | (sigma( if f == null or g == null then 0 else (g - $ag) | . * .     end )) as $ssg
  | ( $prod / (($ssf * $ssg) | sqrt ) ) ;
correlation(f;g;na) computes the correlation coefficient between f and g using correlation(f;g) after filtering out non-numbers and the value na. This implementation avoids the overhead of "map(select())" if possible.

def correlation(f; g; na):
  count( .[]; f != na and g != na and "number" == (f|type) and "number" == (g|type) ) 
    as $count
  | if $count == 0 then . 
    else map(select( f != na and g != na and "number" == (f|type) and "number" == (g|type)))
    end
  | correlation(f; g) ;

[2014.08.04: EDITED def correlation]

nicowilliams commented 10 years ago

BTW, I'll get to all your contributions soon... Sorry for the delay! Keep it up!

On Tue, Jul 1, 2014 at 11:57 PM, pkoppstein notifications@github.com wrote:

Three function names are introduced here:

"count" -- given a generator, count the number of times a condition is met, optionally up to some maximum;

"sigma" -- given a filter that extracts values from each entity in the array input, compute the sum of the values;

"correlation" -- compute a correlation coefficient, taking into account null and possibly missing values.

The functions are as follows, where f and g are filters, and na is a distinguished "not available" value. For correlation and sigma, the input is an array; count ignores its input.

correlation(f;g) and correlation(f;g;na)

sigma(f)

count(generator;condition) and count(generator;condition;max)

Given a generator, count how many times the condition is true; condition should use . to refer to a generated value. Example: count( range(0;100); . * (.|exp) < 100)

def count(generator; condition): reduce generator as $x (0; if ($x | condition) then .+1 else . end);

Given a generator, count how many times the condition is true but stop testing the condition and incrementing the counter once the counter is equal to max; condition should use . to refer to a generated value. count/3 is useful, for example, if we only need to know whether there are enough values to proceed.

def count(generator; condition; max): reduce generator as $x (0; if . < max and ($x | condition) then .+1 else . end);

sigma(f) computes the sum of values extracted from the entities in the input array. Example: sigma( .height * .weight)

def sigma( f ): reduce .[] as $o (0; . + ($o | f )) ;

correlation(f;g) computes the correlation coefficient between two variables, avoiding overflow, and computing N based on non-null pairs; if either f or g references a non-number other than null, then use correlation(f;g;na). Example: correlation( .height; .weight )

def correlation(f; g): count( .[]; (f != null) and (g != null) ) as $n | (sigma( f ) / $n) as $af | (sigma( g ) / $n) as $ag | (sigma( (f - $af) * (g - $ag))) as $prod | (sigma( (f - $af) * (f - $af))) as $ssf | (sigma( (g - $ag) * (g - $ag))) as $ssg | ( $prod / (($ssf * $ssg) | sqrt ) ) ;

correlation(f;g;na) computes the correlation coefficient between f and g using correlation(f;g) after filtering out non-numbers and the value na. This implementation avoids the overhead of "map(select())" if possible.

def correlation(f; g; na): count( .[]; f != na and g != na and "number" == (f|type) and "number" == (g|type) ) as $count | if $count == 0 then . else map(select( f != na and g != na and "number" == (f|type) and "number" == (g|type))) end | correlation(f; g) ;

— Reply to this email directly or view it on GitHub https://github.com/stedolan/jq/issues/457.

nicowilliams commented 10 years ago

@pkoppstein Some comments:

0) Can you give me a PR? Please include updatges for the manual and test cases. 1) count/3 should be implemented in terms of the new foreach instead of reduce. This is important: otherwise count/3 will iterate over all of the generator's outputs, even once it's reached the maximum, but with foreach you can terminate iteration over the generator's outputs (see the new limit as well). 2) Maybe sigma should have a sigma/2 version that takes a generator and computes over its outputs instead of iterating over an input array.

Send me a PR with these comments addressed and I'll merge it :)

stedolan commented 10 years ago

@pkoppstein Thanks for pointing out a definite lack in jq's builtins. Basics stats functions are really useful, and we're sorely lacking them at the moment.

I have a bunch of issues with the code you posted. I don't want to put you off, because you've found a problem in jq and proposed a fix and I want you to hang around and keep doing that so that I can continue being lazy :)

So, disclaimers aside, here goes:

Avoiding map(select(...)) is the wrong way to optimise. We should write the builtins in the clearest, simplest way possible. If that's too slow, we'll add an optimiser to the jq engine which speeds up such patterns (for instance, inlining would be pretty easy, and remove the overhead of calling map and select).

The reason to do it this way is because this way speeds up all jq code, not just the builtins. If we write builtins as natural, normal code, then optimise the engine so that the builtins are fast, we end up optimising all natural, normal code, even the stuff we didn't write.
sigma is badly named. I know you were going for the capital-sigma sum sign, but lowercase sigma means "standard deviation" in stats. Speaking of which, a standard deviation function might be useful.
correlation shouldn't have a na option. jq is designed to make it very easy to compose and pipeline functions. Because it's so easy to filter data before passing it to correlation, the extra argument doesn't help much: it's something else to document and understand without actually adding any value.

This sort of thing makes sense in systems where trivial data processing is awkward, so you give your main program extra options to do trivial transforms to the input and output. In jq, trivial processing is easy, so we can make each operation clear and simple and clean because we know that users who need preprocessing can easily do it themselves.

We shouldn't be trying to provide builtins that do exactly what a user wants, because user requirements are strange, finicky, complicated beasts and I want the standard library to be neither strange nor finicky nor complicated. Instead, we should provide simple operations. They might not be the operations the user wants, but they're simple to understand and easy to compose into something that solves the problem.
Along the same lines, is sigma(f) really necessary? It seems to be equivalent to map(f) | add (except that add returns null on an empty list), and I prefer to give users basic building blocks and easy composition, rather than prebuilt complex operations. Although, if map(f) | add turns out to be a very common idiom, then a better name for sigma might be add/1.

nicowilliams commented 10 years ago

Some such builtins might belong best in a wiki page. But remember @stedolan, we now have an import facility and we could just have a builtin module or three into which to place a large variety of builtins that we don't necessarily want in the main module (because we don't want to compile them unnecessarily and because we want to not "pollute" the main documentation).

wtlangford commented 10 years ago

@nicowilliams definitely hits on one of the main benefits to having the module system - We can now add builtins with special purpose without a) polluting the main namespace, b) flooding the docs, and c) compiling less-commonly-used definitions unless needed.

n.b. that doing this would require modifications to the install process (creating a directory for them) as well as an addition to the module system adding that path to the end of the search chain. On that note, if jq gets installed in non-standard locations, we need to be able to react to that in the module system code. Food for thought.

nicowilliams commented 10 years ago

@wtlangford Well, we could have builtin modules that you must import to use but which are compiled into the jq executable. We might ship lots of modules and only a few builtin modules, in which case if you want the non-builtin ones you need to install the whole thing, not just download the executable.

pkoppstein commented 10 years ago

Thanks, @stedolan and @nicowilliams, for your comments. You've raised a wide variety of issues, ranging from technical details to larger questions about how to partition things between builtin.c, a Statistics module, and documentation, and I'd like to discuss some of these issues with you, especially from the perspective of a jq user and statistician.

(Given what they say about teaching statistics, it's no boast to say that I've done so, but more relevantly, perhaps, I work with large datasets (including large JSON datasets) every day -- that's how I came to jq. Let's call this role the "data practitioner's role". Since jq is, as I understand it, largely intended for people working with JSON data, I trust you'll agree this perspective is not out of place.)

First, however, I'd like to make two general points that may be relevant to the "what belongs where" discussion:

We can already define and use "Stats::FUNCTOR" in builtin.c, so the boundary between "builtin" and module is a bit fuzzy (which could turn out to be a Good Thing).
Since it seems as though it will be a while before jq itself includes a package manager, I would be willing to work on a standalone packet manager that provides basic support for fetching jq "modules" hosted on a site such as github. Let's call it jqpm for now :-) Such a packet manager might help by enabling us to provide a "standard jq library" that's not part of the jq executable.

(Of course, if there is an existing package manager that can (or could be tailored) to do the job, so much the better.)

So let's start with nothing.

From @stedolan's comment about correlation/3 (with "na"), it seems that I can assume correlation/2 and it's handling of null are basically fine. Great. So what to do with correlation/3?

I agree that correlation/3's optimization is fairly obvious, and no doubt a data practitioner would expect correlation/3 to handle "na" efficiently (like correlation/2 handles null), so perhaps the thing to do is to put the optimized version of correlation/3 in a Stats module. Would that be acceptable?

Next up:

is sigma(f) really necessary? It seems to be equivalent to map(f) | add (except that add returns null on an empty list)

Let me refer to the proposed sigma here as SIGMA, as I want to separate the naming issue from the inclusion issue.

There were five reasons for including SIGMA(f):

space efficiency: with an array of objects, it's easy to extract the fields of interest (as in the example -- SIGMA( .height * .weight)) without constructing another array, as the obvious alternatives would require.
to avoid map (for now) -- if map/add becomes efficient one day, it would be easy enough to change the internals of SIGMA; by the way, I am more concerned about space than time - I really do have large datasets -- NOW!
familiarity and expressivity -- it corresponds nicely to "SIGMA over i".
it makes the handling of 0-length arrays quite elegant. (Not an "if" in sight :-)
- general utility - e.g. for computing the mean ( add/length doesn't meet the first space efficiency requirement).

I still think these apply. In other words, I think SIGMA qualifies as a basic building block in @stedolan's sense. (In fact, I didn't include average/1 mainly because SIGMA is the right building block for computing averages, too.)

As for the naming of SIGMA:

It seemed to me that add/1 should probably be reserved for other purposes (perhaps adding two arrays elementwise?).
SIGMA is out-of-character, and Sigma conflicts with the principle that I'd like to support, namely that Capitalized words should generally be reserved for module names.
I don't think statisticians will be confused (we usually work with sigmasquared :-)
summation seemed to me a little long, but is OK by me.

Finally, there's count(generator; condition). Again, I included this not so much to avoid one "map" as to avoid constructing an array (see "space efficiency" above). That is, I can simply write (count (.foo == "bar") against an array of objects without having to worry about the size of the array at all. Efficient with space, efficient with time.

In summary, the immediate questions are:

1) Naming of SIGMA

2) Which of correlation/2, correlation/3, SIGMA, and count (if any) can go in builtin.c" (without "Stats::")?

3) Should any of these go in builtin.c with a "Stats::" prefix?

As an avid jq user, I'd like to see at least count and SIGMA (by whatever name) in builtin.c, with the fastest possible implementations possible, it being understood that the implementation will probably vary from release to release, subject to the functional specification remaining unchanged.

Sorry for the long post, but there is an interesting intersection (collision?) of topics here.

pkoppstein commented 10 years ago

@nicowilliams wrote:

1) count/3 should be implemented in terms of the new foreach instead of reduce.

Yes, except I think the best approach is to use both.

Here's what I've come up with:

def count(generator; condition; max):
  reduce (foreach generator as $x
                (0; if . < max then  if ($x | condition) then .+1 else . end
                     else break
                     end;  .) ) as $i ( 0; $i );

A little tricky to read at first, but it gets the job done.

pkoppstein commented 10 years ago

@nicowilliams suggested:

Maybe sigma should have a sigma/2 version

No problem. For anyone who reads

Σ f
g

as something "sigma f over g", SIGMA(f;g) makes sense, and there are filters with generators in second-position, so I don't see any reason not to use that ordering. OK?

jqlang / jq

enhancement request: sigma, count, correlation #457