The problems with automatic simplification

rikardn commented 4 days ago

TL;DR: I think we shouldn't do any automatic simplification.

I have a big question about the automatic simplifications that symengine does. Because in the cases where I am using symengine it would be better to not do any automatic simplifications I started to think and investigate the topic. I found the two very interesting blog posts by @oscarbenjamin at sympy (https://oscarbenjamin.github.io/blog/czi/post1.html) which claims that the ideal symbolic engine would not do any automatic simplifications at all.

My question is: would it be better to let symengine not do any automatic simplifications and instead move this functionality to refine? (I currently believe it would)

The pros of doing this that I can think of (mostly as argued in the blog post):

Any expression can be encoded by symengine. This is useful for reading and printing expressions as they are. I might want to pretty print for example sin(0) as it is. The original expression might be input by an end user and automatic simplification might be confusing in the context of that user.
Should be faster in many cases when the simplification is not needed. A trivial case is if I have * 0. It is unnecessary to simplify the complex expression since it should end up being cancelled out in the end anyway.
If moved to refine we can do better simplifications since we have access to assumptions. In the auto simplifications we have to assume that symbols are complex and that we are simplifying in the complex domain.
The time complexity of some autosimplifications is quite high. This will make creating an expression tree to take a long time at seemingly random places in the code. For example symengine.zeta(100000) goes on for a while on my computer.
There is no good opt-out of the automatic simplification. There is UnevaluatedExpr, but in my experience it doesn't work well, since it needs to be applied to all levels of an expression tree.
Relying on a certain canonical form of an expression in higher level code becomes problematic if the canonicalization is changed. This could be seen as hidden coupling of the symbolic tree code and higher levels of symengine.
symengine could become a better fit to use in a future sympy

Cons that I can think of:

Being able to rely on (stable) canonical expressions can speed some things up further down the line.
We might create trees that will directly be simplified anyway. For example Float(2.0) * Float(14.5) will be created as 3 nodes in the tree without autosimplification, but with autosimplification directly as 1 without having to go through refine. If we think this is important it would be possible to have both constructors with and without autosimplification.

What are your thoughts on the future of automatic simplification in symengine?

oscarbenjamin commented 4 days ago

I think that there are advantages to automatic simplification in many cases but the problem with the design of both SymPy and SymEngine is the fact that there isn't a good way to opt-out of them. There are basically three kinds of symbolic expression implementations that you can do computer algebra with:

Inert expressions (no simplifications whatsoever).
Semi-canonical expressions (like SymPy/SymEngine).
Canonical expressions (SymPy's polys module or python-flint).

The conclusion that I have come to is that semi-canonical expressions are mostly nice for simple interactive use but usually not ideal for library use which applies both to internal library use (e.g. the internals of SymPy) and also downstream library use such as when PyTorch uses SymPy as a library. Also many users will hit up against the limitations imposed by semi-canonical expressions such as when using SymPy for printing support and they really want to print something like $\frac{1}{\sqrt{2}}$ but they can't.

The internals of SymPy for symbolic maths like say the integration code, solve, matrices, polynomials etc should generally use canonical expressions. The internals for other things like code generation should use inert expressions e.g. because you need to distinguish between (x + y) + z and x + (y + z) if generating floating point code.

It can be easier at the start to build computer algebra with semi-canonical expressions but anything built on them eventually starts to fall apart in more complicated situations. Even if you do want to build something complicated out of semi-canonical expressions (like RUBI) then you need an extensible system of evaluation rules (like Mathematica's) rather than the hard-coded evaluation rules implemented in SymEngine and SymPy.

None of this means that SymPy or SymEngine should not provide users with semi-canonical expressions as a default interface though. Much of what people want to use these for is the kind of simple interactive work that semi-canonical expressions are designed for. We just shouldn't assume that we are going to build other things like codegen, solve etc on top of semi-canonical expressions.

It is already possible in SymPy to switch between semi-canonical and canonical expressions e.g.:

In [1]: e = (x + 1)**2

In [2]: e
Out[2]: 
       2
(x + 1) 

In [3]: e.as_poly()
Out[3]: Poly(x**2 + 2*x + 1, x, domain='ZZ')

In [4]: e.as_poly().as_expr()
Out[4]: 
 2          
x  + 2⋅x + 1

What is not provided though is any representation of inert expressions. Here it is a problem that expression heads are represented as classes. You have to use the sin class to represent sin(x) because no other class can represent it. However you can't create the expression sin(x) without triggering the evaluation code in sin.__new__. Alternatives like evaluate=False and UnevaluateExpr don't work properly: you need to start from a foundation of inert expressions if you want to have any control over what is happening.

There is a discussion for what an interface could look like for evaluation control in sympy in https://github.com/sympy/sympy/issues/25620.

rikardn commented 4 days ago

Thanks for your insights @oscarbenjamin. I particularly like the example of being able to print $\frac{1}{\sqrt{2}}$ and the code generation example, because these are problems that I face myself. Since symengine hasn't yet gotten as entagled in the automatic system as sympy I think it should still be possible to rework it without as much effort. It should also be possible to do this one class at a time.

Transition between inert and semi-canonicalized expressions could be done with the refine function.

I don't see where the transition to the more specialized structures, like for polynomials, should happen though. Lets say we would like to evaluate an expression that happens to be a monomial at some value of x. Then we need to detect that it is indeed a monomial and "lower" it to the corresponding flint structure and let flint do the evaluation. The option being that the expression has already been lowered. In any case there needs to be a system that can detect which lowering would be best to use.

oscarbenjamin commented 4 days ago

Transition between inert and semi-canonicalized expressions could be done with the refine function.

I'm not sure what you mean by this. Fundamentally what happens if I use symengine's Python wrapper and I do:

from symengine import sin

e = sin(0)

What would be the interface for getting the expression with/without evaluation?

Are you suggesting that there would be no implicit evaluation and that a user would always have to use refine(sin(0)) to get it to simplify to 0?

oscarbenjamin commented 4 days ago

In any case there needs to be a system that can detect which lowering would be best to use.

See SymPy's polys module and functions like construct_domain, primitive_element, Domain.unify etc. Also the user needs options to control which domain is used. There also needs to be an extensive system of conversions between different types, unification of domains and so on.

I'm not sure what makes most sense for SymEngine because I don't know what canonical representations it has, or what it would make sense to have. The SymEngine Python wrapper only exposes a semi-canonical representation as far as I can tell.

rikardn commented 4 days ago

I guess I was thinking to always make any simplification explicit (In particular on the C++ side):

>>> from symengine import sin, refine

>>> e = sin(0)
>>> e
sin(0)
>>> refine(e)
0

but an option could be to also have (or the opposite way to be backwards compatible)

>>>from symengine.autorefine import sin

>>> sin(0)
0

or for fans of global state

>>> from symengine import sin

>>> sin(0)
sin(0)
>>> symengine.config['autorefine'] = True
>>>  sin(0)
0

isuruf commented 4 days ago

We can't do that for Add, Mul, etc since we keep a dictionary for performance reasons.

For eg: 1 + x + x becomes 1, {x: 2} 2 + y + 3 + 2*y becomes 5, {y: 3}

oscarbenjamin commented 4 days ago

I suspect that SymEngine would need different types or data structures for inert expressions. I think that things would be designed very differently in that case. It can be possible to have both but I expect that mostly new code would need to be written for inert expressions using new types along with options to convert back and forth. The SymPy/SymEngine design is very much organised around the general idea of semi-canonical expressions even if the exact rules of semi-canonicalisation are different.

rikardn commented 4 days ago

That data structure could be used for the refined version and there could be another for the inert version. But I agree that this solution could then become a source of slowdown.

One of the problems with the current situation for Mul is that the order is not always kept and to a casual user it is seemingly random. Perhaps this could be solved still using the current datastructure.

>>> 2*y + x
x + 2*y

>>> 2*y**2 + x
x + 2*y**2

>>> 2*y + x**2
2*y + x**2

isuruf commented 4 days ago

One of the problems with the current situation for Mul is that the order is not always kept and to a casual user it is seemingly random. Perhaps this could be solved still using the current datastructure.

There's no order in the dictionary, but the args are sorted when printing.

certik commented 4 days ago

@rikardn this is a long standing design decision that one has to make when creating a computer algebra system. The choice that you make has performance implications. The performance of the library depends on the use cases that you have.

SymEngine has been designed with the highest possible performance for the use cases of symbolic manipulation that often happens in physics, robotics, mechanics, etc. It's hard to define exactly, but it's those use cases when users expect things to be reasonably simplified, for example that x - x is just 0.

@oscarbenjamin and I did some benchmarks against his prototype, and even innocent looking use cases like (x**10*sin(x)).diff(x, 100) require canceling of terms, and you can't do it in refine later, because it is always slower to first construct "x - x" in memory and then simplify to "0" compared to just constructing "0" in the first place.

Consequently, the design of SymEngine is to do these simplifications by default, and so it uses corresponding data structures for Add and Mul that do not even allow representing "x - x".

If we want to represent arbitrary symbolic expressions including "x - x", then we need to create new data structures for Add / Mul. These classes should live along side the faster versions. We'll have to have both. As far as I know, nobody has figured out a way to use these more general data structures that allow to represent "x - x" and be as fast as SymEngine on the use cases I mentioned above.

@rikardn for your use case, you simply want "x - x" to stay as such, not simplified to 0, unless you ask it? If so, we can design such data structures, that should actually be quite simple to do, since you just take the expressions and represent them in memory, no simplifications. The complicated part of SymEngine is to simplify addition and multiplication, but if you don't need that, it's a lot simpler. However, it becomes very hard once you want to simplify later and keep highest performance: one option is to simply convert term by term to the current SymEngine expression, although that will always be slower than using the current data structures directly.

rikardn commented 4 days ago

Thanks @certik for giving your perspective. If a parallel system would be ok, I think that could work well. Then we could have some function to do lowering from the "inert" system to the current system. It could even be done by refine. If an automatic simplification in the current system can be seen to reduce performance (like my symengine.zeta(100000) example) perhaps we could on a case by case basis decide to move it to refine (or a doit or other function) instead of having it automatic.

Your example of x - x is interesting because simplifying to 0 is not always correct in my mind. In floating point numerics x - x is not always 0. If x = Inf or x = NaN then x - x would be NaN for example. So the non-simplifying system could be useful when reading and optimizing numerical code.

certik commented 4 days ago

@rikardn good ideas. I think we can do that.

@oscarbenjamin in your experience, what are the best data structures to use to represent the symbolic tree "as is"?

@rikardn can you explain or point me to why current symengine is slow for symengine.zeta(100000)? I would like to understand this use case better.

oscarbenjamin commented 4 days ago

@oscarbenjamin in your experience, what are the best data structures to use to represent the symbolic tree "as is"?

It depends what you want to do with the tree once you have it. There is a lot more to this than just having the expression. Fully inert expressions are not useful without any way to manipulate them to do something.

rikardn commented 3 days ago

@certik The zeta function can be calculated exactly for the even positive integers so symengine.zeta(100000) will directly do the calculation. This can take a long time. This might be completely unnecessary in expressions like symengine.zeta(100000)*0 or symengine.zeta(100000)/symengine.zeta(100000) and I think it should be deferred to a later point with an explicit call to refine, doit or similar, strictly for performance reasons.

oscarbenjamin commented 3 days ago

Note that you don't need anything as exotic as the zeta function to run into performance issues with automatic evaluation:

In [3]: import symengine

In [4]: x = symengine.symbols('x')

In [5]: e = x**x**x

In [6]: e
Out[6]: x**(x**x)

In [7]: e.subs(x, 10)
^C^C^C^C^\Quit (core dumped)

There should be some sort of threshold on this. At the same time there should be some way to just create the expression without running any risk of unbounded computation. It might be that what you really want to do is evaluate the expression numerically or perhaps you just want to print it and so on.

In [1]: i = Integer(10)

In [2]: e = Pow(i, Pow(i, i, evaluate=False), evaluate=False)

In [3]: e
Out[3]: 
  ⎛  10⎞
  ⎝10  ⎠
10      

In [4]: e.evalf()
Out[4]: 1.00000000000000e+10000000000

In general though evaluate=False is not a good option not least because other operations with the expression will trigger evaluation. There should be a clear inert expression type.

rikardn commented 3 days ago

Thanks @oscarbenjamin for finding a better example than mine. I am trying to expand a bit on @certik's suggestion to figure out if the systems could live side by side. That would mean that we could both have a class Pow with the current autosimplification and an InertPow that wasn't simplified. Could we expect both of these to be able to live together in the same system?


In [3]: import symengine

In [4]: x = symengine.inertsymbol('x')   # Need different class here to dispatch the correct ** operator

In [5]: e = x**x**x

In [6]: e   # I is now of InertPow type
Out[6]: x**(x**x)

In [7]: e.subs(x, 10)
  ⎛  10⎞
  ⎝10  ⎠
10

If classes from both systems are mixed in the same expression the lower one could take precedence or they could be kept mixed until, e.g. refine.

oscarbenjamin commented 3 days ago

I think that they can live together but designing it so that it is understandable for users needs careful thought. I would probably not allow the two types to be mixed together in the same expression tree because it would get confusing if there is a mixed tree.

symengine / symengine

The problems with automatic simplification #2056