tBuLi commented 6 years ago

Introduction

A very common fitting problem is to find a vector x which solves the matrix equation A x = y. Currently such problems cannot be solved using symfit, unless one explicitly writes out all the components to the system. For example, if A is a 2x2 matrix, then in symfit we have to write:

A_mat = np.array([[2, 3],
                  [4, 1]])
y_dat = np.array([2, 2])

x1, x2 = parameters('x1, x2')
y1, y2 = variables('y1, y2')
a11, a12, a21, a22 = variables('a11, a12, a21, a22')

model = {
    y1: a11 * x1 + a12 * x2,
    y2: a21 * x1 + a22 * x2,
}

fit = Fit(model, y1=y[0], y2=y_dat[1], a11=A_mat[0, 0], ...)

Apart from seeming unnecessarily verbose, this also becomes computationally expensive very quickly. Compare this to the following syntax which uses sympy's beautiful Indexed objects:

A, x, y = symbols('A, x, y', cls=IndexedBase)
i, j = symbols('i, j', cls=Idx)

model = {
    y[i]: A[i, j] * x[j]  # or y[i]: Sum(A[i, j] * x[j], j) for the mathematicians
} 

fit = Fit(model, A=A_mat, y=y_dat)

This syntax is infinitely more readable, and it instructs symfit on how to interpret the data so the arrays we feed to Fit do not have to be dissected, giving a huge performance gain.

The only downside to this beautiful code is that it doesn't work. Firstly, for symfit to be happy the distinction between parameters and variables will have to be reintroduced. Secondly, an IndexedParameterBase and IndexedParameter object have to be added. Thirdly, Variable's will have to inherit from IndexedBase instead of Symbol. With these changes, the example will end up looking like this:

A, y = variables('A, y')
x = symbols('x', cls=IndexedParameterBase)
i, j = symbols('i, j', cls=Idx)

model = {
    y[i]: A[i, j] * x[j]
}

fit = Fit(model, A=A_mat, y=y_dat)

New Features

Changing to this syntax will open symfit up to a whole range of exciting and powerful features.

These are just some of the examples:

Global fitting problems will be a walk in the park. For example, a model with a global intersect but a local slope per condition to be fitted, could be written as

A, y = variables('A, y')
x_loc = symbols('x_loc', cls=IndexedParameterBase)
x_glob = parameters('x_glob')

i, j = symbols('i, j', cls=Idx)

model = {
    y[i]: A[i, j] * x_loc[j] + x_glob
}

fit = Fit(model, A=A_mat, y=y_dat)

Equality constraints could be added to any model in the form of Lagrange-multipliers. For example, suppose we want to do a linear fit y = a * x + b, subject to a + b == 1. Using this new syntax, one would write:
```
x, y = parameters('x, y')
l, = parameters('l') # Lagrange multiplier
i = symbols('i', cls=Idx)

model = {y[i]: a * x[i] + b}
fit = Fit(model, constraints={l: Eq(a + b, 1)})
```
Internally, this can then be wrapped as such:
```
chi2 = Sum(y[i]**2 - (a * x[i] + b)**2, i)
L = chi2 + l * (a + b - 1)
```
The system of equations to be solved is then the jacobian of the Lagrangian L. Issue #148 can therefore not be solved before this is introduced.
Constraint which need data. So far, constraints can only be given on relations between Parameter's, but not with regards to data. Soon one could do something like (artificial example)
```
a, b = parameters('a, b')
x, y = variables('x, y')
i = symbols('i', cls=Idx)

model = {y[i]: a * x[i] + b}
constraints = [
    Eq(a * Sum(x[i], (i, 0, len(xdata)), 1)
]
```
I do not necessarily know use-cases for this, but perhaps you do!

Possible Issues

A point of discussion is wether to write sums explicitly, or to go for the Einstein-summation convention as used above. As a reminder, in the Einstein convention we assume that repeated indices are summed over. Hence, instead of Sum(A[i, j] * x[j], j), we write simply A[i, j] * x[j], where the sum is implicit. However, this leads to problems when taking derivatives. For example, deriving this sum with respect to x[k], one would expect to get A[i, k] after summation. But if no sum is performed, the answer is KroneckerDelta(i, k)*A[i, j].

Despite sympy's documentation saying that summation is implicit with these objects, the answer it returns when deriving A[i, j] * x[j] is KroneckerDelta(i, k)*A[i, j], so clearly no sum has been performed. This might cause problems in our determination of Jacobians, and therefore a sum should be performed at some point.

A solution might be to require the users to write the sum explicitely, or to see if there is some way to infer which indices have been repeated, and to add a sum over those.
Another danger is to infer the range of the indices from the data, which is needed to perform such sums. The danger here is that if we assign this range to the Idx provided by the user, we end up modifying the object they thought they were dealing with. Definitely not gentlemanly.

Possible solutions are to work with a Dummy-copy of the original Idx, or to not asign the range to the Idx objects directly but to give it to Sum directly when needed, e.g. Sum(x[i], (i, 0, m -1)) where m is len(x). However, this would require more bookkeeping on our part.
Should Variable inherit from IndexedBase, or do we need a new type IndexedVariable to keep the destinction between the two types? In my experience so far, if you create an instance of IndexedBase but you don't provide it with Idx, you can still infer its name etc., so at this time I do not see a reason why changing to Variable(IndexedBase) should not be possible without any loss of functionality. See second commend down

tBuLi commented 6 years ago

As an afterthought, it might be nice to support the following syntax for the creation of IndexedVariables and IndexedParameters:

x_ij, y_ij = variables('x[i, j], y[i, j]')
a, b_i = parameters('a, b[i]')

or

x, (i, j), y, (i, j) = variables('x[i, j], y[i, j]')  # Would this tuple unpacking work?
a, b, (i,) = parameters('a, b[i]')
# x_ij = x[i, j], but explicitly using x[i, j] is probably more readable.

This syntax is particularly useful for parameters, since there we will define two distinct types. Regular old Parameter's, and indexed ones.

tBuLi commented 6 years ago

After playing around some more, I have found out that simply making Variable's inheret from IndexedBase is not a viable solution for dealing with ODEModel's, as D(y, t) rightfully throws ValueError: Can't calculate 1st derivative wrt t.

Solutions would be to force everyone to change their code to D(y[i], t[i]), bad idea and less readable, or to just have two types of Variable: the existing one, and a new indexed version.

# Existing objects
class Argument(Expr):
    pass

class Variable(Argument, Symbol):
    pass

# New indexed counterparts
class IndexedArgument(Indexed):
    pass

class IndexedVariableBase(Argument, IndexedBase):
    """Counterpart of `Variable`, e.g. `x`"""

class IndexedVariable(IndexedArgument):
    """e.g. `x[i, j]`"""

The same should happen for Parameters.

Apart from the syntax proposed above, we can also add a keyword argument indexed to the convenience methods:

x, y = variables('x, y', indexed=True)
i, j = indices('i, j')
x_ij = x[i, j]

tBuLi commented 6 years ago

I'm very close to pushing a first branch for review, as all the features that don't involve indexed parameters have now been implemented.

However, the issue now becomes how to deal with index ranges. Because in order to fully benefit from sympy's symbolic capabilities, indices need to know their range. This range can be infered from the data but, as stated earlier, this has to be done explicitly, not implicit. For example, in

x, y = variables('x, y', indexed=True)
i, = indices('i')

model = {y[i]: a * x[i] + b}
fit = Fit(model, x=xdata, y=ydata)

we could have Fit look at the data, and set the range of i based on that. However, that would mean that i.range now has a value, whereas I as the user never set it at all. And as the zen of python dictates: explicit is better than implicit.

In the example above, one can already do the following: i, = indices('i', range=len(xdata)). This is not a bad solution, since it forces the user to think about his shapes, and whether they are sensible.

A convenience method I kinda like, might be to make this work:

x, (i, j), y, (i, j) = variables('x[i, j], y[i, j]', x=xdata.shape, y=ydata.shape)

This would do all the range setting in one line, and would therefore allow to check if the two shapes do indeed match.

Edit: perhaps it is also suffecient to just do

fit = Fit(model, x=xdata, y=ydata, infer_range=True)

such that the user gives permission to change the range internally from the data's shape.

tBuLi commented 5 years ago

Something I've noticed regarding indexed parameters is that this pretty much already works out of the box, since IndexedBase objects accept other symbols as a label. This means we can say e.g.:

x = IndexedBase(Parameter('x', value=[1, 2, 3]))
i = Idx('i', (0, 2))
f = Sum(x[i], i)

where x will be correctly seen as a parameter and will end up in the minimizer as such. I think this syntax could be the way to go about this, as it stays very close to pure sympy and will involve only minimal adaptations of symfit.

Jhsmit commented 5 years ago

Consider me the first beta tester. I need this!

Jhsmit commented 4 years ago

Is it the future yet? Its 2020 we should have hoverboards and Index Variables!

tBuLi / symfit

Index Variables are the future #175

Introduction

New Features

Possible Issues