Some of the tests rely on testing hard values, like the ones that failed in #127 due to scikit-learn's iris dataset update. They could probably fail again if for instance we used another initialisation or another optimizer for some algorithms, while the algorithm would still be valid. Therefore I think these tests could still be useful as benchmarking tasks, to ensure we keep making a good score for some basic tasks, but we should still probably rely more on testing toy examples, testing properties of the solution rather than hard values, that would work no matter the initialization or the optimisation procedure.
Yes, I've wanted to fix this issue with the test suite since gh-51. It'll be difficult to get the right test coverage without over-specifying results, but any progress on this is welcome.
Some of the tests rely on testing hard values, like the ones that failed in #127 due to scikit-learn's iris dataset update. They could probably fail again if for instance we used another initialisation or another optimizer for some algorithms, while the algorithm would still be valid. Therefore I think these tests could still be useful as benchmarking tasks, to ensure we keep making a good score for some basic tasks, but we should still probably rely more on testing toy examples, testing properties of the solution rather than hard values, that would work no matter the initialization or the optimisation procedure.