josdejong / mathjs

An extensive math library for JavaScript and Node.js
https://mathjs.org
Apache License 2.0
14.35k stars 1.24k forks source link

Subsetting a matrix with dynamic range that return a single value #205

Closed hakib closed 10 years ago

hakib commented 10 years ago

Hey, I saw you did some work on matrices in 0.26 and though i take the chance to ask a question.

When subsetting a matrix you check if the index is scalar (e.g should return a single value) and return the unwrapped value code.

ones(1,2)[:,1]
1

I've encountered some problems when operating on matrices. When i subset a matrix row - I assume i'm getting an array back. When i subset a matrix column i expect to get a matrix. This is true when subsetting a matrix with more then one row :

ones(2,2)[:,1]
[[1], [1]]

But as illustrated above, when the matrix has one row - i'll get the unwrapped an integer. If later on in the code i'm trying to subset the result (as I expect a matrix) I would get an error.

I was thinking that maybe the index object could know if he's based on a dynamic range, and if he is, return false for isScalar. What do you think ?

josdejong commented 10 years ago

Good point! Thanks, that will indeed improve consistency.

hakib commented 10 years ago

Sure Jos. Just a little correction to the above - subsetting for a single row returns a matrix (as it should) and no an array as I wrote.

m = round(random([2,2]) * 10)
[[3, 2], [5, 1]]
m[1,:] 
[[3, 2]] <-- This is fine
m[:,1]
[[3], [5]]
m[1:1,:]
[[3, 2]]
m[:,1:1]
[[3], [5]]
m[2:,1:1]
5 <-- should be [[5]]

But, when there is only one column in the row it will return the unwrapped value.

m = round(random([2,1]) * 10)
[[2], [6]]
m[:,1]
[[2], [6]]
m[1,:]
2 <-- Should be [[2]]
josdejong commented 10 years ago

yes indeed, we should only squeeze the result when the index contains only scalars, instead of when the result turns out to contain just a scalar.

hakib commented 10 years ago

Adding an additional test case for this issue, though not entirely sure it's directly related

math.version
"0.26.0"

math.eval("m[:,1]",{ m: [[null,1]]}) 
TypeError: Cannot read property 'valueOf' of null

math.eval("m[:,2]",{ m: [[null,1]]}) 
1

math.eval("m[1]",{ m: [null,1]}) 
TypeError: Cannot read property 'valueOf' of null

math.eval("m[2]",{ m: [null,1]}) 
1

math.eval("m[1,1]",{ m: [[null,1]]}) 
TypeError: Cannot read property 'valueOf' of null

math.eval("m[1,1]",{ m: [[math.null,1]]}) 
TypeError: Cannot read property 'valueOf' of null

Looks like it's failed when the subset result is null here.

If i pass the data in the expression it self it works fine.

math.eval("m = [[null,1]]; m[:,1]").valueOf()
[null]
josdejong commented 10 years ago

Thanks, this was indeed not related but definitely a bug. I fixed that one already.

hakib commented 10 years ago

Thanks Jos. Why didn't it fail when I defined the data in the expression ? I also tried it in the editor in mathjs.org and it worked fine (congratulations on the new responsive site btw).

josdejong commented 10 years ago

It only failed when m was an Array, not when m was a Matrix. By default expression parser creates matrices (can be configured though).

Thanks, it was quite some work to get all docs and examples on the website (and still some work to do), but I'm really glad with the result!

I'm very close to releasing version 1.0.0 now, trying to get the details of the API consistent, so thanks a lot for your feedback in this regard. If you come across more inconsistencies/oddities please let me know.

One thing where I'm still in doubt is what default value to take when resizing a matrix: right now values are left undefined by default, unless you provide a default value. This is good for performance, but I think in general you would like it to be filled with zeros by default.

hakib commented 10 years ago

Back to the original issue for a minute, You said

we should only squeeze the result when the index contains only scalars, instead of when the result turns out to contain just a scalar.

I raised the issue of subsetting a matrix column, but what about rows ?

Currently

a = [[1,2],[3,4]]
[[1, 2], [3, 4]]
a[2:,:]
[[3, 4]] <-- Expected
a[2,:]
[[3, 4]] <-- ? 

Following your logic, a[2,:] does not contain only scalars but the user did request only one row using a scalar index so he expects an array.

a[2,:]
[3, 4] <-- Expected

What do you think ?

Regarding squeeze, my take on it is that zero is a value and unless the user explicitly requested we shouldn't make up values. I think that setting null and allowing the user to change it is the best solution.

If you need any help with the front end I can help. I'm a bit afraid to touch your carefully crafted code but front ends are much harder to mess up ;}.

I got some more issues / requests / ideas pending so if your on the verge of a significant release i'll post them soon.

josdejong commented 10 years ago

Thanks, suggestions are very welcome :).

I've been thinking about squeezing a matrix subset to vectors too (besides scalars). I understand the idea, but I don't think this will really work intuitively in practice. I think in practice you mostly work with 2D matrices. After getting a subset from a matrix, you want to do something with it, like adding/subtracting to some other matrix, multiplying, concatenating, etc. And most of these operations expect the matrices to operate on to have the same number of dimensions (i.e. 2D). So if math.js would often return a 1D matrix instead, you would need to resize it again before being able to use it for your operation. On the other hand, when math.js just returns the subset unsqueezed like it does now, you can easily squeeze it into a vector yourself when needed using the function squeeze.

As for default values when resizing matrices: right now I leave the values undefined (similar to JS itself, when you do var a = [], a[3] = 1 results in an Array [undefined, undefined, 1]. This is very fast as nothing has to be done for the undefined entries. Filling out a default value null may be a neat idea (as long as it remains possible to choose undefined for speed): the user can clearly distinguish "empty" entries, and when I add support for implicit conversion from null to 0 (like JS does), you can immediately do all operations and the null values behave like they are 0 (i.e. math.eval('A=[]; A[3]=1; B=A+1') would result in A=[null, null, 1] and B=[1, 1, 2]).

hakib commented 10 years ago

Regarding the first comment - I believe the logic you defined is kept - e.g, squeeze the result when the index contains only scalars - But applied per dimension. If the user request a specific row and intend to perform additional computations it's reasonable to assume he expects an array.

Regarding the empty values - I personally think that one constant should be used to indicate empty values. From a user perspective, What is the difference between Null and undefined ? no difference.

Regarding the second issue, I had a lot of trouble with how strict the expression parser is, especially with empty values. I started out with angular and they take a very forgiving approach when evaluating expressions. I also originally come from databases so I was used to writing stuff like select count(*), count(*) - count(field) as missing_values where field is not being counted when its null. Prior to version 0.19 when you implemented aggregate functions I had my own implementation of sum, mean etc where nulls are being treated as 0. Once I upgraded I had to add a whole set of function handling missing values (hence the need for apply / ifElse... ) and it made writing expression a bit more complicated. This might take some work since all the arithmetic operations add,sub ... should handle empty values. Some aggregation function will need alteration - mean cannot be implemented as sum(x) / x.length and there will be an impact on performance. But none the less, I think it's a great idea and i'm happy to hear you thought about it.

On the same note of compatibility with JS. Why not add support for array and matrices when evaluating conditions ? [] ? true : false is currently not supported. Maybe even expose testCondition as a function isTruthy and enable external (unwrapped) functions to evaluate conditions on custom types (complex, bignumber ...).

josdejong commented 10 years ago

Sounds like most of your issues boil down to a lack of support for null. Is that correct? I will implement support for null on all functions, so I suppose that will resolve all these issues and need to write custom versions of functions?

math.js is indeed quite strict in the sense that it only accepts recognized types. JavaScript is very forgiving, sometimes a little bit too forgiving, resulting in quite some nasty pitfalls. Few examples:

'2' + 2 //  returns '22', not 4
[5, 10, 1].sort() // returns [1, 10, 5] 

On the other hand, math.js accepts a lot of data types (number, bignumber, unit, etc), and more to be added (like support for null :D ). This makes it very easy to work with mixed datatypes in a safe way. No silent errors, you will get clear error messages if you pass unsupported arguments or combinations of arguments.

As for the condtions: this indeed only supports booleans. Doing a=[]; a ? true : false is a little ambiguous: what do you want to check? check whether a contains "something?" or testing whether a is an empty array? It is quite easy to write an equality check to test for you exactly mean: something like either a != null or size(a)==[0] ? true : false.

hakib commented 10 years ago

Yes, a lot of my problems are indeed related to missing values. Once this issue will be resolved i would need a lot less custom functions. I've added other custom functions such as rank, sort, reverse, filter by boolean index, cummulative, roll/row/col apply, lead/lag and a bunch of lodash functions. One might say I need a data frame and not a matrix ;)

About the condition, Your right, It is ambiguous. It's even more confusing when you look at other languages.

Java Script

[] ? true : false
true

python

In [1]: a = []
In [2]: if a:
    print(True)
else:
    print(False)
   ...:     
False

R

>> ifelse(data.frame())
logical(0)

I can go either way. The current state is logically correct.

josdejong commented 10 years ago

Ok thanks for your feedback. Let's leave the conditional operator as it is right now, if lots of people mark this as a big inconvenience we can rethink it again.

I will implement support for null asap.

hakib commented 10 years ago

Hey Jos, I just went over you two commits. First of all - Thanks, This is amazing job. Second, From the development branch

math.std([1,1,1])
0
math.std([1,1,1,null,null,null])
0.5477225575051661
math.mean([1,2,3])
2
math.mean([1,2,3,null,null,null])
1

It depends on how you define null. If it's zero than the results above are OK. If null is an empty value than the calculation should ignore them and the results above are wrong. In the case of mean/std (and many more) it means you cant use size.

josdejong commented 10 years ago

Yes indeed, right now null is handled as if it's 0 (except in case of == and !=).

I'm not sure what users would expect in such a case. At least how it works right now is very intuitive, and I consider null values as edge cases. But maybe we should explore the use cases where you have to deal with null values in data a little bit more.

I think ignoring null values in statistical functions will give harder to understand behavior seen from a user point of view. (And besides that, we would need to add more overhead to all statistics functions).

With the current behavior, you can accomplish both scenarios: if you want to count null values as zero, you can just run the functions as is. If you want to ignore them, you could filter them out first (which you may need to do anyway), like

var rawData = [1,2,3,null,null,null];
var filteredData = rawData.filter(function (value) {
  return value !== null;
});
math.mean(rawData);      // returns 1
math.mean(filteredData); // returns 2
hakib commented 10 years ago

What worries me about your example is that you can not achieve this by using mathjs expressions alone (you can obviously calculate it on your own or even make a custom function but the point is to provide generic tools).

Lets take your recent dilemma for example, a matrix mean is 10, after resizing it (with the function's default arguments) - Should the mean change ?

josdejong commented 10 years ago

What worries me about your example is that you can not achieve this by using mathjs expressions alone

What's not there in the current version can be implemented in a next version of course, if it offers a generic solution :). Adding for example functions sortand filter sound very generic and powerful.

Lets take your recent dilemma for example, a matrix mean is 10, after resizing it (with the function's default arguments) - Should the mean change ?

Well, that would be logical to me, taking a simple definition like mean(A) = sum(A) / size(A).

You make me more and more curious about your use case. As far as I understand it, you have datasets containing empty (null) entries, and you want to do all kind of statistical analysis on that data, where you want the statistical functions to ignore the empty entries. When dealing with data processing, I've so far never used empty values in data sets, its more the opposite: before doing any data analysis, I normally preprocess the raw data to take out any corrupt entries (like empty data points), to prevent this kind of issues and other surprises.

josdejong commented 10 years ago

Maybe interesting in this regard to see how other applications deal with this:

http://stackoverflow.com/questions/5480694/numpy-calculate-averages-with-nans-removed http://stackoverflow.com/questions/12646180/execute-formulas-ignoring-zero-elements http://mathematica.stackexchange.com/questions/17455/ignore-missing-data-while-doing-a-mean

hakib commented 10 years ago

Hey Jos, I've also looked at the way other languages implement this and most of them implement some sort of na removal function prior to processing.

Using R I would write

> m = c(1,2,3,4,NA)
> mean(m)
[1] NA
> mean(na.omit(m))
[1] 2.5

Which i personally think it's a pain because in oracle (my mass processing weapon of choice) nulls are ignored gracefully

  1  with x as (
  2     select 1 n from dual union all
  3     select 2 n from dual union all
  4     select null n from dual
  5  )
  6  select  count(*) n_rows,
  7     count(n) counting_non_null_values,
  8     avg(n)   avg_of_non_null_values
  9* from       x
SQL> /

    N_ROWS COUNTING_NON_NULL_VALUES AVG_OF_NON_NULL_VALUES
---------- ------------------------ ----------------------
         3                        2                    1.5

I find the second option much more appealing and straight forward. I also think that it's currently not possible to implement something such as na.omit in mathjs because it doesn't support unbalanced matrices.

The way you implemented it is fine because most cases (arithmetic operations except div) no value is equivalent to zero. But when applying statistic functions where the size of the collections is used, The length of the collection is not necessarily the size of the collection.

As i said, I added several utility function, Two of which I found useful for handling most cases. default (equivalent to decode \ nvl \ coalesce ) that accepts a list of values and return the first not none, and default0 which is the same as default but the list is appended with zero at the end. I don't like these functions because they make otherwise simple expressions quite complicated and unreadable (every collection is wrapped with map and a custom function).

I find that the best approach that usually fits all is setting good defaults. The simple user using mathjs as a client side engine for processing data will most likely wont have the knowledge nor willpower to handle these edge cases. The sophisticated user on the other hand can always validate it's collection before processing the data hence non of the above is getting hurt.

josdejong commented 10 years ago

I like this solution of R mean(na.omit(m)). It would be great to have an easy solution to apply a "mask" to a matrix and apply the masked/filtered matrix to statistics functions.

hakib commented 10 years ago

Cant say that i feel the same but none the less , Here is a complete example

> matrix(1:9,3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> m = matrix(1:9,3)
> m[1,1] = NA
> m
     [,1] [,2] [,3]
[1,]   NA    4    7
[2,]    2    5    8
[3,]    3    6    9
> na.omit(m)
     [,1] [,2] [,3]
[1,]    2    5    8
[2,]    3    6    9
attr(,"na.action")
[1] 1
attr(,"class")
[1] "omit"
> mean(m)
[1] NA
> mean(na.omit(m))
[1] 5.5

Notice that R removed the entire row. If you don't want this behavior (and i dont believe anyone does) You can easily get into a pickle if you don't support unbalanced matrices

version
"0.26.0"
m = [[1,2],[3]]
DimensionError: Dimension mismatch (1 != 2)
josdejong commented 10 years ago

Yes indeed, the Matrix type does not support unbalanced matrices, though regular Arrays do:

math.config({matrix: 'array'})

var scope = {};
math.eval('A=[[1,2,3],[4,5]]', scope); // Array [[1,2,3],[4,5]]
math.eval('mean(A)'); // 3

It's interesting to think about the necessity for Matrix to check for balanced rows. I think in general it makes sense, as most functions cannot deal with unbalanced matrices and this prevents the need for all of them to validate input size.

I can perfectly understand why R removes the entire row: I think in practice the data on one row is related (say the x, y, z values of an accelerometer). Here the matrix must stay balanced: the column index has a meaning (x, y, z), and the data on a single row represents one measurement. In order to remove invalid entries (containing one or multiple columns with NA), you want to remove the entire row.

How often would you deal with unbalanced matrices in practice? Do you have any use cases for this?

hakib commented 10 years ago

I didn't know about this option {matrix: 'array'}. This might actually solve some other issues for me as well. Am I giving up some functionality by setting this option ? ( I guess that I wont get a type matrix as a result no more but i'm checking that either way because of some inconsistencies).

I don't have pure uses cases for unbalanced matrices, I'ts mostly wanting to use data structures as a data source for the computation e.g math.eval('sum(m[1])', { m : [[1,2],[1]] }).

If you decide to ignore nulls in statistical functions some performance issues might benefit from using a different algorithm. For example, mean can be calculated incrementally.

josdejong commented 10 years ago

Thanks for clarifying your use case, that can make sense. In that regard support for JSON objects (in the expression parser) would be nice too :).

What a Matrix does is:

Under the hood, some math.js functions change an Matrix into an array Array before doing their thing, or vice versa (when utililty methods of a Matrix are needed for example).

I think it's best to leave the statistics functions as they are now, not doing any smart or magic things, and aim for making it easy use filters/masks when needed.

hakib commented 10 years ago

Thanks for the answer. I investigated a bit and it looks like you can't really subset an unbalanced array of arrays.

m = math.create({matrix: 'array'})
m.eval('x[1,:]', { x : [[1,2,3,4],[1,2,3],[1,2],[1]] })
DimensionError: Dimension mismatch (3 != 4)
m.eval('x[1,1]', { x : [[1,2,3,4],[1,2,3],[1,2],[1]] })
DimensionError: Dimension mismatch (3 != 4)
m.eval('x[1]', { x : [[1,2,3,4],[1,2,3],[1,2],[1]] })
DimensionError: Dimension mismatch (3 != 4)

Is that intentional ?

josdejong commented 10 years ago

I think a better description would be that it's unintentionally if any function does work with unbalanced arrays...

The subset function is one of these functions that convert Array input to a Matrix first, to utilize the functions available there. It may be possible to rewrite it such that it does not need Matrix, that will require some refactoring.