Snapshot testing of `simplify`

sylee957 commented 1 year ago

I think that there is a controversy about how simplify should be tested, in the circumstances like

https://github.com/sympy/sympy/pull/25605/files#discussion_r1311918902

Where the fix had made the simplify fail some test cases.

I had hold an opinion that the output of simplify, should rather be considered invariant by its precise 'literal output', and you should just write the copy-pasted output of the test, no matter how it simplifies weakly or not, for example,

In [3]: simplify(p)
Out[3]: 
⎧1      for x < 2    
⎪                    
⎨2  for x ≤ 3 ∧ x > 1
⎪                    
⎩3      otherwise

Any other ideas, like @xfail, commenting out, may lead to violating the rules of unit testing, or just prioritizes the comfort of development than making tests more conservative,

Notes:

Although I semantically consider exact-form tests of simplify or integrate as snapshot tests, people don't grasp the idea as same as I think, and test cases are not distinguished well by something that changes and something that does not. And even for that, I and @oscarbenjamin had conflicting ideas about how to do it properly.
Writing snapshot tests are often long and verbose. If the expression gets more bigger. In javascript, writing or updating snapshots are automated by some toolings like https://jestjs.io/docs/snapshot-testing. In Python, there is pytest-snapshot, but this may be more difficult because we may need to make every SymPy objects (internal use or not) always serializable to json or str.
The snapshots can make the diffs bigger, however, this is usually considered a compromise to making the library or app more robust about giving visual regression for users. And although having no test cases like that could have made everyone happier, we couldn't have catched the issue of how simplify changed something afterwards.
I don't think that hypothesis may help on this, because I can't get an idea about how invariants of simplify looks like. Maybe this is one of the worst examples of how hypothesis cannot help.

oscarbenjamin commented 1 year ago

I like the idea of snapshot testing. Manually updating tests when making significant changes is extremely tedious.

Can you spell out a bit more what it might look like for SymPy to use snapshot testing by showing some example code?

There are many cases other than simplify where different forms of equivalent output can be expected after a change e.g. solve, integrate, many printers etc. Being able to test a change by having all of the tests automatically updated and then reviewing the diff would be much more efficient for many of these.

Ideally the snapshot tests would look like doctests that are automatically updated because reviewing the diff with doctests is usually easiest.

I don't think that snapshot testing is suitable for all things e.g. if the outputs have a clear uniquely defined form then we should test for that directly.

Does snapshot testing also provide a way to record what the actual best output would be though in cases where it is known?

For example in the case of simplify(p) above it is unambiguous that the output shown is not the best possible output. How would you record that with snapshot testing?

Something else that is not well captured by the current test suite is the case of things that are known to fail. Ideally every bug report would be turned into a test that could run under something like snapshot testing so that we would always know in future if the output of the code in question changes. Often a change will fix some things but break some others and it would be more useful to see a selection of things that are fixed alongside things that are broken rather than just seeing which of the currently passing tests happen to fail after a change.

sylee957 commented 1 year ago

I tried syrupy, and its usage is:

sympy/matrices/tests/test_eigen.py

def test_jordan_snapshot(snapshot):
    H = Matrix([[3, 0, 0, 0], [0, 1, 2, 0], [0, 2, 2, 0], [0, 0, 0, 4]])
    assert H.jordan_form() == snapshot

sympy/matrices/tests/__snapshots__/test_eigen.ambr

# serializer version: 1
# name: test_jordan_snapshot
  tuple(
    Matrix([
  [1, 0,                      0,                      0],
  [0, 0, -2/(-1/2 + sqrt(17)/2), -2/(-sqrt(17)/2 - 1/2)],
  [0, 0,                      1,                      1],
  [0, 1,                      0,                      0]]),
    Matrix([
  [3, 0,                0,                0],
  [0, 4,                0,                0],
  [0, 0, 3/2 - sqrt(17)/2,                0],
  [0, 0,                0, 3/2 + sqrt(17)/2]]),
  )
# ---

I think that there were some other snapshot test middlewares in Python, but this keeps the maintainance for now.

sylee957 commented 1 year ago

Does snapshot testing also provide a way to record what the actual best output would be though in cases where it is known? For example in the case of simplify(p) above it is unambiguous that the output shown is not the best possible output. How would you record that with snapshot testing?

Unfortuantely, I don't think that there are easy solutions for this, Snapshot tests are usually very lightweight, and helps users generate the test cases with indefinite and complicate examples (like symbolic computation, images, or API outputs) easily, and unopinionates about how you use that in tests.

So we can combine with existing future tests like @XFAIL to cover certain needs like that.

And you may also recognize that the idea can be technically complicated, and there are not many test idioms that encompasses that. For example, how to decide which one is 'best'? It means you need some indefinite metrics like 'score', or provide a handcrafted example of best result, or store a sequence of tests in some server to decide what is the best, and these things may not be very lightweight and free of cost.

oscarbenjamin commented 1 year ago

how to decide which one is 'best'?

Sometimes this is hard but sometimes it is not. The current testing process can easily handle the case where it is not hard e.g. if the expression should simplify to zero.

sylee957 commented 1 year ago

e.g. if the expression should simplify to zero.

I think that we had relied on handcrafted examples for the tests like that, and to test simplify(expr) == 0, for that test cases, snapshots are just recorded as zero.

I don’t think snapshot tests are limited than that, because it just fills out the expected test outputs for you, than writing yourself.

sylee957 commented 1 year ago

Ideally the snapshot tests would look like doctests

I think that they look different though. Doctests were much more annoying to correct by hand, especially if the expressions are large, and especially if the pretty printing is involved.

So I’d rather replace some doctests by ellipsis, for complicated result, if we get the confidence that they are tested in any other way.

oscarbenjamin commented 1 year ago

Doctests were much more annoying to correct by hand, especially if the expressions are large, and especially if the pretty printing is involved.

That is why I want them to be corrected automatically. Note that I am not talking here about the actual doctests for things that are in the documentation but rather having separate test files for snapshot testing but where the tests are all formatted as doctests. Correcting doctests automatically is tedious but I think that reviewing the diff when doctests are corrected is the clearest way to see what is changed. The idea is that if snapshot testing does the correction then we don't need to worry about that and can instead focus on the reviewing part so we want the format best for reviewing.

asmeurer commented 1 year ago

Not sure about this. This sort of thing tends to lead to do-nothing tests that only test that a function outputs what it outputs, so that any output (other than error) would pass the test. If a function output changes, even to a mathematically equivalent form, it's annoying to update, but it really should be checked carefully.

If you really believe that checking against a diff is easier than checking while updating tests manually, then you can use some tool to update things automatically. Those already exist. But I that doesn't mean we need to integrate them into the testing workflow for everyone. Most people wouldn't check the diff that carefully when it's created for them automatically, so the extra manual step is useful.

We already have too many of this style of test, in my opinion, which is why I want to move more towards tests that actually check the mathematical content of functions.

oscarbenjamin commented 1 year ago

If a function output changes, even to a mathematically equivalent form, it's annoying to update, but it really should be checked carefully.

The idea is to spend your time on that checking rather than on manually typing out the code that updates all of the tests.

This sort of thing tends to lead to do-nothing tests that only test that a function outputs what it outputs, so that any output (other than error) would pass the test.

I would actually envisage using this sort of testing instead of XFAIL tests as well. Then the test can show exactly how the XFAIL test fails:

>>> solve([x**2 - exp(x/10) - 1], [x])
 ...
NotImplementedError: could not solve x**2 - exp(x/10) - 1

Now if this ever changes you can consider whether or not that change is a good thing. Sometimes a change in one place would cause some things to fail but would make other things work. Our current XFAIL testing does not reveal this easily.

sympy / sympy

Snapshot testing of `simplify` #25610