josdejong / mathjs

An extensive math library for JavaScript and Node.js
https://mathjs.org
Apache License 2.0
14.38k stars 1.24k forks source link

quantileSeq does not accept individual numeric arguments in lieu of an array; does not document dimension argument #2077

Open SamuelBolduc opened 3 years ago

SamuelBolduc commented 3 years ago

Functions like sum or mean accept either a matrix of values or a numeric value, but quantileSeq does not. Wouldn't it make sense for it to have the same behavior as those other functions?

josdejong commented 3 years ago

That could be interesting. I guess it will conflict though with the probOrN option, so it will require some thinking through.

Can you explain your use case for passing dimensions to quantileSeq?

SamuelBolduc commented 3 years ago

Use case for us is to calculate percentile. Let's say a user wants to get the value 50th percentile value out of the monthly total of transactions over the past 12 months. We generate a formula that's basically quantileSeq(a, 0.5) and we then pass the values as a matrix for a when evaluating.

In some cases (other formulas), we needed this "a" to be a numeric value. Since all other functions we use like sum or mean accept both a numeric value or a matrix of values, we made the assumption it would always work (probably didn't think it through but we got away with it for a long time) so every time we have a matrix of a single value, we convert it to a numeric value.

Of course we could have added logic to our algorithm to only convert a one-item matrix to a number when it's needed, but it was not so simple because of the way we built it. Current workaround is to generate a different formula if the range length selected is only one (percentile of a single value is kind of like a no-op in the end), which fixes it.

We just thought it might make sense in the future to support it, but you're right that there are probably other use-cases that I'm not aware of for this function where it would be problematic

josdejong commented 3 years ago

Thanks for your explanation. It makes sense I think.

There is not necessarily a conflict with the options that quantileSeq already supports, though it is not possible to use the 2nd argument similar to sum and mean: in quantileSeq the second argument is interpreted as math.quantileSeq(A, prob) (see docs). The current syntax is:

math.quantileSeq(A, prob[, sorted])
math.quantileSeq(A, [prob1, prob2, ...][, sorted])
math.quantileSeq(A, N[, sorted])

We could introduce new syntax like:

math.quantileSeq(A, prob[, dimension][, sorted])

It will be a tricky one though, it would be easy to accidentally order prob and dimension wrongly or think the second argument is the dimension. An other idea could be passing an object with all (optional) options, so the options are named:

math.quantileSeq(A, { prob, dimension, sorted })
gwhitney commented 1 year ago

I am confused about what is wanted here. I think the original question just wanted the ability for the "sequence" of which the quantiles are being computed to be allowed (as a trivial case) to be a single number, the way it is allowed to compute mean(5) and get 5 as a trivial case. Currently, for example, quantileSeq(5, 0.5) is an error, rather than being 5. And in that I think the original asker is correct that there is a formal analogy with mean and friends, which would suggest quantileSeq(5, 0.5) be allowed and be equal to quantileSeq([5],0.5), which is 5.

The responses seem to be about allowing a matrix of values and producing a vector/matrix of either row quantile values or column quantile values (in addition to the current behavior, which is to flatten the matrix before computing the quantiles). There's also an argument for this, by virtue of analogy with mean etc.

Both of these are possible changes/enhancements to quantileSeq that mathjs might want to make, but they are distinct.

I think it remains open to decide which of the two should be implemented (possibly both or neither) and then make it so.

dvd101x commented 1 year ago

Recently quantileSeq was updated with the parameter stating the dimension in which to apply the formula. math.quantileSeq(A, prob[, dimension][, sorted]) math.quantileSeq(A, prob[, sorted][, dimension]) that Jos mentioned. The dimension doesn't makes sense for a flat array, so the update doesn't invalidate the request.

Currently there is an issue when not all of the stat functions accept a dimension, so there is another area of opportunity for consistency.

As a reference in numpy or octave there is no way of doing sum(2, 3) it must be sum([2, 3]), the first argument shall be the Array/Matrix. One way to be consistent and avoid this tricky situation is to remove the option to do numbers in all stat functions and ensure that they can be applied in any dimension.

gwhitney commented 1 year ago

Somehow the update that added the dimension snuck through without documentation: there is no mention of it in help(quantileSeq) or in the on-line documentation. So as I see it, two things remain here:

As far as other stat functions that might ideally be changed, I think those should be addressed in separate issue(s). [Note: the optional step was edited to correct argument orders.]

gwhitney commented 1 year ago

(just to clarify: the optional item in my immediately prior post would be a feature enhancement; the mandatory item is a documentation "bug" -- i.e., the current documentation is incomplete. Hence the two new labels.)

dvd101x commented 1 year ago

Thanks Glen, sorry about the wrong syntax.

I'll check about the missing documentation and make a PR.

This was for #1552 which you might find interesting.

gwhitney commented 1 year ago

Great. Seems like that took care of #1552, so closed that. You might want to wait on a PR until there's a decision whether to support the quantileSeq(3,7,2,10,0.5) syntax...

dvd101x commented 1 year ago

Ok, I will wait for a decision.

On the optional item quantileSeq(1, 2, 3, 4, 5, true, 0.25) it's flipping the regular order ( quantileSeq([1, 2, 3, 4, 5], 0.25, true) ) to find the current probOrN?

I'm asking because it could be left as such, due to probOrN being mandatory quantileSeq(1, 2, 3, 4, 5, 0.25, true)

The issue comes when probOrN is an array of numbers from 0 to 1 (an array of non integers greater to 1 would throw an error) quantileSeq([1, 2, 3, 4, 5], [0.25, 0.5, 0.75]). I don't know how could the spread syntax be for cases like that.

gwhitney commented 1 year ago

Oops, now it's my turn to have scrambled the arguments (hence the importance of complete correct documentation ;-) I have edited the optional step to reflect the correct argument order.

That actually makes the argument parsing a little easier. You have to have a signature that accepts a number as the first argument (that distinguishes it already from all the existing signatures) and following that, one or more arguments with a spread operator, each of which can be either a number, bignumber, Array, Matrix, or boolean. Then inside the function, you just check the last argument to see if it's boolean. If it is, interpret it as the sorted flag and pop it off. Now the last argument has to be a number or Array or Matrix, which you interpret as probOrN, and pop that off. Then you check that all the remaining arguments (if any) are numbers or bignumbers (maybe they need to all be the same type out of these two?), unshift that first argument back on, and there's your sequence. So really implementing the "list the numbers" version of quantileSeq is just a few lines of code. That swings me to the "might as well do it" camp, I don't see the downside, but let's wait for Jos to weigh in at this point -- possibly he will see it as "unnecessary clutter".

dvd101x commented 1 year ago

Ok, so the spread would be only for the first argument and not for the rest.

gwhitney commented 1 year ago

Right, that was my intended proposal. The first draft putting the sorted flag before the probOrN argument was just my mixing up the order, not some too-clever attempt to use the boolean as the separator. I don't see any need for the probOrN to also be allowed to be just a list of arguments at the top level -- none of the other stats functions work like that, with two list arguments that are allowed to be spread into the top-level arguments.

josdejong commented 1 year ago

Thanks for all the inputs.

I think I did misunderstand the original request a bit and was focusing on a new option dimension, which is now available (only the documentation is missing).

It will be neat to support spread values like quantileSeq(1, 2, 3, 4, 5...). And technically it is possible to pull the options apart from the tail of the arguments. My biggest concern is that something like quantileSeq(1, 2, 3, 4, 5, 0.25, true) is simply confusing to read for a human. So maybe we should not want that syntax and opt for a more explicit syntax. How about introducing a single object to hold all the options, like:

math.quantileSeq(A, { probability, dimension, sorted })
math.quantileSeq(a, b, c, ..., { probability, dimension, sorted })

And hence only allow spread values in combination with an options object? I think this would improve readability greatly. We could even think about deprecating some of the other syntaxes to simplify usage a bit and remove confusion.

gwhitney commented 1 year ago

I am totally on board with only allowing the spread sequence in the case of the final parameter being an options object, and allowing the options object after a matrix (with no other arguments) as well. I think that should become the explicit goal of this issue.

As far as deprecating/removing existing call patterns, we could start just by saying in the new documentation that the use of the options argument is encouraged as the other call patterns may be deprecated or eliminated in the future.

josdejong commented 1 year ago

👍

As far as deprecating/removing existing call patterns, we could start just by saying in the new documentation that the use of the options argument is encouraged as the other call patterns may be deprecated or eliminated in the future.

That sounds like a good plan!