Open SamuelBolduc opened 3 years ago
That could be interesting. I guess it will conflict though with the probOrN
option, so it will require some thinking through.
Can you explain your use case for passing dimensions
to quantileSeq
?
Use case for us is to calculate percentile. Let's say a user wants to get the value 50th percentile value out of the monthly total of transactions over the past 12 months. We generate a formula that's basically quantileSeq(a, 0.5)
and we then pass the values as a matrix for a when evaluating.
In some cases (other formulas), we needed this "a" to be a numeric value. Since all other functions we use like sum
or mean
accept both a numeric value or a matrix of values, we made the assumption it would always work (probably didn't think it through but we got away with it for a long time) so every time we have a matrix of a single value, we convert it to a numeric value.
Of course we could have added logic to our algorithm to only convert a one-item matrix to a number when it's needed, but it was not so simple because of the way we built it. Current workaround is to generate a different formula if the range length selected is only one (percentile of a single value is kind of like a no-op in the end), which fixes it.
We just thought it might make sense in the future to support it, but you're right that there are probably other use-cases that I'm not aware of for this function where it would be problematic
Thanks for your explanation. It makes sense I think.
There is not necessarily a conflict with the options that quantileSeq
already supports, though it is not possible to use the 2nd argument similar to sum
and mean
: in quantileSeq
the second argument is interpreted as math.quantileSeq(A, prob)
(see docs). The current syntax is:
math.quantileSeq(A, prob[, sorted])
math.quantileSeq(A, [prob1, prob2, ...][, sorted])
math.quantileSeq(A, N[, sorted])
We could introduce new syntax like:
math.quantileSeq(A, prob[, dimension][, sorted])
It will be a tricky one though, it would be easy to accidentally order prob
and dimension
wrongly or think the second argument is the dimension. An other idea could be passing an object with all (optional) options, so the options are named:
math.quantileSeq(A, { prob, dimension, sorted })
I am confused about what is wanted here. I think the original question just wanted the ability for the "sequence" of which the quantiles are being computed to be allowed (as a trivial case) to be a single number, the way it is allowed to compute mean(5)
and get 5 as a trivial case. Currently, for example, quantileSeq(5, 0.5)
is an error, rather than being 5. And in that I think the original asker is correct that there is a formal analogy with mean
and friends, which would suggest quantileSeq(5, 0.5)
be allowed and be equal to quantileSeq([5],0.5)
, which is 5.
The responses seem to be about allowing a matrix of values and producing a vector/matrix of either row quantile values or column quantile values (in addition to the current behavior, which is to flatten the matrix before computing the quantiles). There's also an argument for this, by virtue of analogy with mean
etc.
Both of these are possible changes/enhancements to quantileSeq that mathjs might want to make, but they are distinct.
I think it remains open to decide which of the two should be implemented (possibly both or neither) and then make it so.
Recently quantileSeq
was updated with the parameter stating the dimension in which to apply the formula.
math.quantileSeq(A, prob[, dimension][, sorted])
math.quantileSeq(A, prob[, sorted][, dimension])
that Jos mentioned. The dimension doesn't makes sense for a flat array, so the update doesn't invalidate the request.
Currently there is an issue when not all of the stat functions accept a dimension, so there is another area of opportunity for consistency.
As a reference in numpy or octave there is no way of doing sum(2, 3)
it must be sum([2, 3])
, the first argument shall be the Array/Matrix. One way to be consistent and avoid this tricky situation is to remove the option to do numbers in all stat functions and ensure that they can be applied in any dimension.
Somehow the update that added the dimension snuck through without documentation: there is no mention of it in help(quantileSeq)
or in the on-line documentation. So as I see it, two things remain here:
quantileSeq(3, 7, 5, 0.5)
producing 5 and quantileSeq(1,2,3,4,5,0.25,true)
producing 2 and quantileSeq(1,2,3,4,5,[0.25,0.75],true)
producing [2,4]
, etc. I could go either way on this. On the one hand it will be a bit of a pain with the mixed argument types after an arbitrary number of numbers, and will likely get limited use. But as the original poster points out, allowing these would be consistent with related functions, so I think on balance ideally this change would be made. @josdejong, your thoughts?As far as other stat functions that might ideally be changed, I think those should be addressed in separate issue(s). [Note: the optional step was edited to correct argument orders.]
(just to clarify: the optional item in my immediately prior post would be a feature enhancement; the mandatory item is a documentation "bug" -- i.e., the current documentation is incomplete. Hence the two new labels.)
Thanks Glen, sorry about the wrong syntax.
I'll check about the missing documentation and make a PR.
This was for #1552 which you might find interesting.
Great. Seems like that took care of #1552, so closed that. You might want to wait on a PR until there's a decision whether to support the quantileSeq(3,7,2,10,0.5)
syntax...
Ok, I will wait for a decision.
On the optional item quantileSeq(1, 2, 3, 4, 5, true, 0.25)
it's flipping the regular order ( quantileSeq([1, 2, 3, 4, 5], 0.25, true)
) to find the current probOrN
?
I'm asking because it could be left as such, due to probOrN
being mandatory quantileSeq(1, 2, 3, 4, 5, 0.25, true)
The issue comes when probOrN
is an array of numbers from 0 to 1 (an array of non integers greater to 1 would throw an error) quantileSeq([1, 2, 3, 4, 5], [0.25, 0.5, 0.75])
. I don't know how could the spread syntax be for cases like that.
Oops, now it's my turn to have scrambled the arguments (hence the importance of complete correct documentation ;-) I have edited the optional step to reflect the correct argument order.
That actually makes the argument parsing a little easier. You have to have a signature that accepts a number as the first argument (that distinguishes it already from all the existing signatures) and following that, one or more arguments with a spread operator, each of which can be either a number, bignumber, Array, Matrix, or boolean. Then inside the function, you just check the last argument to see if it's boolean. If it is, interpret it as the sorted flag and pop it off. Now the last argument has to be a number or Array or Matrix, which you interpret as probOrN, and pop that off. Then you check that all the remaining arguments (if any) are numbers or bignumbers (maybe they need to all be the same type out of these two?), unshift that first argument back on, and there's your sequence. So really implementing the "list the numbers" version of quantileSeq is just a few lines of code. That swings me to the "might as well do it" camp, I don't see the downside, but let's wait for Jos to weigh in at this point -- possibly he will see it as "unnecessary clutter".
Ok, so the spread would be only for the first argument and not for the rest.
quantileSeq([1, 2, 3, 4, 5], [0.25, 0.5, 0.75])
quantileSeq(1, 2, 3, 4, 5, [0.25, 0.5, 0.75])
quantileSeq(1, 2, 3, 4, 5, 0.25, 0.5, 0.75)
quantileSeq([1, 2, 3, 4, 5], [0.25, 0.5, 0.75], true)
quantileSeq(1, 2, 3, 4, 5, [0.25, 0.5, 0.75], true)
quantileSeq(1, 2, 3, 4, 5, 0.25, 0.5, 0.75, true)
or quantileSeq(1, 2, 3, 4, 5, true, 0.25, 0.5, 0.75)
Right, that was my intended proposal. The first draft putting the sorted flag before the probOrN argument was just my mixing up the order, not some too-clever attempt to use the boolean as the separator. I don't see any need for the probOrN to also be allowed to be just a list of arguments at the top level -- none of the other stats functions work like that, with two list arguments that are allowed to be spread into the top-level arguments.
Thanks for all the inputs.
I think I did misunderstand the original request a bit and was focusing on a new option dimension
, which is now available (only the documentation is missing).
It will be neat to support spread values like quantileSeq(1, 2, 3, 4, 5...)
. And technically it is possible to pull the options apart from the tail of the arguments. My biggest concern is that something like quantileSeq(1, 2, 3, 4, 5, 0.25, true)
is simply confusing to read for a human. So maybe we should not want that syntax and opt for a more explicit syntax. How about introducing a single object to hold all the options, like:
math.quantileSeq(A, { probability, dimension, sorted })
math.quantileSeq(a, b, c, ..., { probability, dimension, sorted })
And hence only allow spread values in combination with an options object? I think this would improve readability greatly. We could even think about deprecating some of the other syntaxes to simplify usage a bit and remove confusion.
I am totally on board with only allowing the spread sequence in the case of the final parameter being an options object, and allowing the options object after a matrix (with no other arguments) as well. I think that should become the explicit goal of this issue.
As far as deprecating/removing existing call patterns, we could start just by saying in the new documentation that the use of the options argument is encouraged as the other call patterns may be deprecated or eliminated in the future.
👍
As far as deprecating/removing existing call patterns, we could start just by saying in the new documentation that the use of the options argument is encouraged as the other call patterns may be deprecated or eliminated in the future.
That sounds like a good plan!
Functions like
sum
ormean
accept either a matrix of values or a numeric value, butquantileSeq
does not. Wouldn't it make sense for it to have the same behavior as those other functions?