Closed hakib closed 10 years ago
Good point! Thanks, that will indeed improve consistency.
Sure Jos. Just a little correction to the above - subsetting for a single row returns a matrix (as it should) and no an array as I wrote.
m = round(random([2,2]) * 10)
[[3, 2], [5, 1]]
m[1,:]
[[3, 2]] <-- This is fine
m[:,1]
[[3], [5]]
m[1:1,:]
[[3, 2]]
m[:,1:1]
[[3], [5]]
m[2:,1:1]
5 <-- should be [[5]]
But, when there is only one column in the row it will return the unwrapped value.
m = round(random([2,1]) * 10)
[[2], [6]]
m[:,1]
[[2], [6]]
m[1,:]
2 <-- Should be [[2]]
yes indeed, we should only squeeze the result when the index contains only scalars, instead of when the result turns out to contain just a scalar.
Adding an additional test case for this issue, though not entirely sure it's directly related
math.version
"0.26.0"
math.eval("m[:,1]",{ m: [[null,1]]})
TypeError: Cannot read property 'valueOf' of null
math.eval("m[:,2]",{ m: [[null,1]]})
1
math.eval("m[1]",{ m: [null,1]})
TypeError: Cannot read property 'valueOf' of null
math.eval("m[2]",{ m: [null,1]})
1
math.eval("m[1,1]",{ m: [[null,1]]})
TypeError: Cannot read property 'valueOf' of null
math.eval("m[1,1]",{ m: [[math.null,1]]})
TypeError: Cannot read property 'valueOf' of null
Looks like it's failed when the subset result is null here.
If i pass the data in the expression it self it works fine.
math.eval("m = [[null,1]]; m[:,1]").valueOf()
[null]
Thanks, this was indeed not related but definitely a bug. I fixed that one already.
Thanks Jos. Why didn't it fail when I defined the data in the expression ? I also tried it in the editor in mathjs.org and it worked fine (congratulations on the new responsive site btw).
It only failed when m
was an Array, not when m
was a Matrix. By default expression parser creates matrices (can be configured though).
Thanks, it was quite some work to get all docs and examples on the website (and still some work to do), but I'm really glad with the result!
I'm very close to releasing version 1.0.0 now, trying to get the details of the API consistent, so thanks a lot for your feedback in this regard. If you come across more inconsistencies/oddities please let me know.
One thing where I'm still in doubt is what default value to take when resizing a matrix: right now values are left undefined by default, unless you provide a default value. This is good for performance, but I think in general you would like it to be filled with zeros by default.
Back to the original issue for a minute, You said
we should only squeeze the result when the index contains only scalars, instead of when the result turns out to contain just a scalar.
I raised the issue of subsetting a matrix column, but what about rows ?
Currently
a = [[1,2],[3,4]]
[[1, 2], [3, 4]]
a[2:,:]
[[3, 4]] <-- Expected
a[2,:]
[[3, 4]] <-- ?
Following your logic, a[2,:]
does not contain only scalars but the user did request only one row using a scalar index so he expects an array.
a[2,:]
[3, 4] <-- Expected
What do you think ?
Regarding squeeze, my take on it is that zero is a value and unless the user explicitly requested we shouldn't make up values. I think that setting null and allowing the user to change it is the best solution.
If you need any help with the front end I can help. I'm a bit afraid to touch your carefully crafted code but front ends are much harder to mess up ;}.
I got some more issues / requests / ideas pending so if your on the verge of a significant release i'll post them soon.
Thanks, suggestions are very welcome :).
I've been thinking about squeezing a matrix subset to vectors too (besides scalars). I understand the idea, but I don't think this will really work intuitively in practice.
I think in practice you mostly work with 2D matrices. After getting a subset from a matrix, you want to do something with it, like adding/subtracting to some other matrix, multiplying, concatenating, etc. And most of these operations expect the matrices to operate on to have the same number of dimensions (i.e. 2D). So if math.js would often return a 1D matrix instead, you would need to resize it again before being able to use it for your operation. On the other hand, when math.js just returns the subset unsqueezed like it does now, you can easily squeeze it into a vector yourself when needed using the function squeeze
.
As for default values when resizing matrices: right now I leave the values undefined (similar to JS itself, when you do var a = [], a[3] = 1
results in an Array [undefined, undefined, 1]
. This is very fast as nothing has to be done for the undefined entries. Filling out a default value null
may be a neat idea (as long as it remains possible to choose undefined
for speed): the user can clearly distinguish "empty" entries, and when I add support for implicit conversion from null
to 0
(like JS does), you can immediately do all operations and the null
values behave like they are 0
(i.e. math.eval('A=[]; A[3]=1; B=A+1')
would result in A=[null, null, 1]
and B=[1, 1, 2]
).
Regarding the first comment - I believe the logic you defined is kept - e.g, squeeze the result when the index contains only scalars - But applied per dimension. If the user request a specific row and intend to perform additional computations it's reasonable to assume he expects an array.
Regarding the empty values - I personally think that one constant should be used to indicate empty values. From a user perspective, What is the difference between Null and undefined ? no difference.
Regarding the second issue, I had a lot of trouble with how strict the expression parser is, especially with empty values. I started out with angular and they take a very forgiving approach when evaluating expressions. I also originally come from databases so I was used to writing stuff like select count(*), count(*) - count(field) as missing_values
where field is not being counted when its null. Prior to version 0.19 when you implemented aggregate functions I had my own implementation of sum
, mean
etc where nulls are being treated as 0. Once I upgraded I had to add a whole set of function handling missing values (hence the need for apply
/ ifElse
... ) and it made writing expression a bit more complicated. This might take some work since all the arithmetic operations add
,sub
... should handle empty values. Some aggregation function will need alteration - mean
cannot be implemented as sum(x) / x.length
and there will be an impact on performance. But none the less, I think it's a great idea and i'm happy to hear you thought about it.
On the same note of compatibility with JS. Why not add support for array and matrices when evaluating conditions ? [] ? true : false
is currently not supported. Maybe even expose testCondition
as a function isTruthy
and enable external (unwrapped) functions to evaluate conditions on custom types (complex, bignumber ...).
Sounds like most of your issues boil down to a lack of support for null
. Is that correct? I will implement support for null on all functions, so I suppose that will resolve all these issues and need to write custom versions of functions?
math.js is indeed quite strict in the sense that it only accepts recognized types. JavaScript is very forgiving, sometimes a little bit too forgiving, resulting in quite some nasty pitfalls. Few examples:
'2' + 2 // returns '22', not 4
[5, 10, 1].sort() // returns [1, 10, 5]
On the other hand, math.js accepts a lot of data types (number, bignumber, unit, etc), and more to be added (like support for null
:D ). This makes it very easy to work with mixed datatypes in a safe way. No silent errors, you will get clear error messages if you pass unsupported arguments or combinations of arguments.
As for the condtions: this indeed only supports booleans. Doing a=[]; a ? true : false
is a little ambiguous: what do you want to check? check whether a contains "something?" or testing whether a is an empty array? It is quite easy to write an equality check to test for you exactly mean: something like either a != null
or size(a)==[0] ? true : false
.
Yes, a lot of my problems are indeed related to missing values. Once this issue will be resolved i would need a lot less custom functions. I've added other custom functions such as rank, sort, reverse, filter by boolean index, cummulative, roll/row/col apply, lead/lag and a bunch of lodash functions. One might say I need a data frame and not a matrix ;)
About the condition, Your right, It is ambiguous. It's even more confusing when you look at other languages.
Java Script
[] ? true : false
true
python
In [1]: a = []
In [2]: if a:
print(True)
else:
print(False)
...:
False
R
>> ifelse(data.frame())
logical(0)
I can go either way. The current state is logically correct.
Ok thanks for your feedback. Let's leave the conditional operator as it is right now, if lots of people mark this as a big inconvenience we can rethink it again.
I will implement support for null asap.
Hey Jos, I just went over you two commits. First of all - Thanks, This is amazing job. Second, From the development branch
math.std([1,1,1])
0
math.std([1,1,1,null,null,null])
0.5477225575051661
math.mean([1,2,3])
2
math.mean([1,2,3,null,null,null])
1
It depends on how you define null. If it's zero than the results above are OK. If null is an empty value than the calculation should ignore them and the results above are wrong. In the case of mean/std (and many more) it means you cant use size
.
Yes indeed, right now null
is handled as if it's 0
(except in case of ==
and !=
).
I'm not sure what users would expect in such a case. At least how it works right now is very intuitive, and I consider null
values as edge cases. But maybe we should explore the use cases where you have to deal with null values in data a little bit more.
I think ignoring null
values in statistical functions will give harder to understand behavior seen from a user point of view. (And besides that, we would need to add more overhead to all statistics functions).
With the current behavior, you can accomplish both scenarios: if you want to count null
values as zero, you can just run the functions as is. If you want to ignore them, you could filter them out first (which you may need to do anyway), like
var rawData = [1,2,3,null,null,null];
var filteredData = rawData.filter(function (value) {
return value !== null;
});
math.mean(rawData); // returns 1
math.mean(filteredData); // returns 2
What worries me about your example is that you can not achieve this by using mathjs expressions alone (you can obviously calculate it on your own or even make a custom function but the point is to provide generic tools).
Lets take your recent dilemma for example, a matrix mean is 10, after resizing it (with the function's default arguments) - Should the mean change ?
What worries me about your example is that you can not achieve this by using mathjs expressions alone
What's not there in the current version can be implemented in a next version of course, if it offers a generic solution :). Adding for example functions sort
and filter
sound very generic and powerful.
Lets take your recent dilemma for example, a matrix mean is 10, after resizing it (with the function's default arguments) - Should the mean change ?
Well, that would be logical to me, taking a simple definition like mean(A) = sum(A) / size(A)
.
You make me more and more curious about your use case. As far as I understand it, you have datasets containing empty (null
) entries, and you want to do all kind of statistical analysis on that data, where you want the statistical functions to ignore the empty entries.
When dealing with data processing, I've so far never used empty values in data sets, its more the opposite: before doing any data analysis, I normally preprocess the raw data to take out any corrupt entries (like empty data points), to prevent this kind of issues and other surprises.
Maybe interesting in this regard to see how other applications deal with this:
http://stackoverflow.com/questions/5480694/numpy-calculate-averages-with-nans-removed http://stackoverflow.com/questions/12646180/execute-formulas-ignoring-zero-elements http://mathematica.stackexchange.com/questions/17455/ignore-missing-data-while-doing-a-mean
Hey Jos, I've also looked at the way other languages implement this and most of them implement some sort of na removal function prior to processing.
Using R I would write
> m = c(1,2,3,4,NA)
> mean(m)
[1] NA
> mean(na.omit(m))
[1] 2.5
Which i personally think it's a pain because in oracle (my mass processing weapon of choice) nulls are ignored gracefully
1 with x as (
2 select 1 n from dual union all
3 select 2 n from dual union all
4 select null n from dual
5 )
6 select count(*) n_rows,
7 count(n) counting_non_null_values,
8 avg(n) avg_of_non_null_values
9* from x
SQL> /
N_ROWS COUNTING_NON_NULL_VALUES AVG_OF_NON_NULL_VALUES
---------- ------------------------ ----------------------
3 2 1.5
I find the second option much more appealing and straight forward. I also think that it's currently not possible to implement something such as na.omit
in mathjs because it doesn't support unbalanced matrices.
The way you implemented it is fine because most cases (arithmetic operations except div
) no value is equivalent to zero. But when applying statistic functions where the size of the collections is used, The length
of the collection is not necessarily the size of the collection.
As i said, I added several utility function, Two of which I found useful for handling most cases. default
(equivalent to decode
\ nvl
\ coalesce
) that accepts a list of values and return the first not none, and default0
which is the same as default
but the list is appended with zero at the end. I don't like these functions because they make otherwise simple expressions quite complicated and unreadable (every collection is wrapped with map
and a custom function).
I find that the best approach that usually fits all is setting good defaults. The simple user using mathjs as a client side engine for processing data will most likely wont have the knowledge nor willpower to handle these edge cases. The sophisticated user on the other hand can always validate it's collection before processing the data hence non of the above is getting hurt.
I like this solution of R mean(na.omit(m))
. It would be great to have an easy solution to apply a "mask" to a matrix and apply the masked/filtered matrix to statistics functions.
Cant say that i feel the same but none the less , Here is a complete example
> matrix(1:9,3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> m = matrix(1:9,3)
> m[1,1] = NA
> m
[,1] [,2] [,3]
[1,] NA 4 7
[2,] 2 5 8
[3,] 3 6 9
> na.omit(m)
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
attr(,"na.action")
[1] 1
attr(,"class")
[1] "omit"
> mean(m)
[1] NA
> mean(na.omit(m))
[1] 5.5
Notice that R removed the entire row. If you don't want this behavior (and i dont believe anyone does) You can easily get into a pickle if you don't support unbalanced matrices
version
"0.26.0"
m = [[1,2],[3]]
DimensionError: Dimension mismatch (1 != 2)
Yes indeed, the Matrix
type does not support unbalanced matrices, though regular Arrays do:
math.config({matrix: 'array'})
var scope = {};
math.eval('A=[[1,2,3],[4,5]]', scope); // Array [[1,2,3],[4,5]]
math.eval('mean(A)'); // 3
It's interesting to think about the necessity for Matrix
to check for balanced rows. I think in general it makes sense, as most functions cannot deal with unbalanced matrices and this prevents the need for all of them to validate input size.
I can perfectly understand why R removes the entire row: I think in practice the data on one row is related (say the x, y, z values of an accelerometer). Here the matrix must stay balanced: the column index has a meaning (x, y, z), and the data on a single row represents one measurement. In order to remove invalid entries (containing one or multiple columns with NA
), you want to remove the entire row.
How often would you deal with unbalanced matrices in practice? Do you have any use cases for this?
I didn't know about this option {matrix: 'array'}
. This might actually solve some other issues for me as well. Am I giving up some functionality by setting this option ? ( I guess that I wont get a type matrix
as a result no more but i'm checking that either way because of some inconsistencies).
I don't have pure uses cases for unbalanced matrices, I'ts mostly wanting to use data structures as a data source for the computation e.g math.eval('sum(m[1])', { m : [[1,2],[1]] })
.
If you decide to ignore nulls in statistical functions some performance issues might benefit from using a different algorithm. For example, mean can be calculated incrementally.
Thanks for clarifying your use case, that can make sense. In that regard support for JSON objects (in the expression parser) would be nice too :).
What a Matrix does is:
Under the hood, some math.js functions change an Matrix into an array Array before doing their thing, or vice versa (when utililty methods of a Matrix are needed for example).
I think it's best to leave the statistics functions as they are now, not doing any smart or magic things, and aim for making it easy use filters/masks when needed.
Thanks for the answer. I investigated a bit and it looks like you can't really subset an unbalanced array of arrays.
m = math.create({matrix: 'array'})
m.eval('x[1,:]', { x : [[1,2,3,4],[1,2,3],[1,2],[1]] })
DimensionError: Dimension mismatch (3 != 4)
m.eval('x[1,1]', { x : [[1,2,3,4],[1,2,3],[1,2],[1]] })
DimensionError: Dimension mismatch (3 != 4)
m.eval('x[1]', { x : [[1,2,3,4],[1,2,3],[1,2],[1]] })
DimensionError: Dimension mismatch (3 != 4)
Is that intentional ?
I think a better description would be that it's unintentionally if any function does work with unbalanced arrays...
The subset
function is one of these functions that convert Array input to a Matrix first, to utilize the functions available there. It may be possible to rewrite it such that it does not need Matrix, that will require some refactoring.
Hey, I saw you did some work on matrices in 0.26 and though i take the chance to ask a question.
When subsetting a matrix you check if the index is scalar (e.g should return a single value) and return the unwrapped value code.
I've encountered some problems when operating on matrices. When i subset a matrix row - I assume i'm getting an array back. When i subset a matrix column i expect to get a matrix. This is true when subsetting a matrix with more then one row :
But as illustrated above, when the matrix has one row - i'll get the unwrapped an integer. If later on in the code i'm trying to subset the result (as I expect a matrix) I would get an error.
I was thinking that maybe the index object could know if he's based on a dynamic range, and if he is, return false for
isScalar
. What do you think ?