Make tests robust & independent

glatterf42 commented 8 months ago

Closes #512.

Analogous to https://github.com/iiasa/message_ix/pull/784, this PR aims at making the ixmp tests less flaky. For now, this is largely done by giving Scenarios and Platforms test-specific names.

This PR reverts the mitigation from #510. If all goes surprisingly well, this won't be a problem. Either way, it should not be a problem in the end if the intended changes are complete.

How to review

Read the diff and note that the CI checks all pass, ideally without manual re-runs.

PR checklist

[x] Continuous integration checks all ✅
[x] Add or expand tests; coverage checks both ✅
~[ ] Add, expand, or update documentation.~ Just CI tests
~[ ] Update release notes.~ Just CI tests

glatterf42 commented 8 months ago

Some notes here:

[x] I've grouped the tutorial tests so that they run on the same worker. This seems like a cheat considering we want to use xdist, but the *_scenario tutorials seem to rely on the * tutorials running first. However, this grouping might not solve the issue of these required tutorials failing in the first place.
[x] I don't understand why test_export_timeseries_data is failing. It almost seems like occasionally, test_mp.export_timeseries_data() doesn't find the data it's supposed to export. Is this test relying on other tests filling the fixture test_mp with data?
[x] We should still search the tests for FLAKY markers and see if we can resolve more of them.

glatterf42 commented 8 months ago

The fixes in test_integration seem to work locally, but I'm not sure they are optimal yet. They seem to hardcode how many rows in the scenario column of the expected dfs need to be changed. I'm not sure we want to do that, but maybe it's okay to expect these particular tests to yield the same results.

glatterf42 commented 8 months ago

On my system and on the windows-py3.12 platform here, there is this test failure in test_base.py:

FAILED ixmp/tests/backend/test_base.py::TestCachingBackend::test_del_ts - AssertionError: assert 0 == 1
E       AssertionError: assert 0 == 1
E        +  where 1 = len({(133743699349232, 'par', 'd'):            i         j  value unit\n0    seattle  new-york    2.5   km\n1    seattle   c...a    1.8   km\n3  san-diego  new-york    2.5   km\n4  san-diego   chicago    1.8   km\n5  san-diego    topeka    1.4   km})
E        +    where {(133743699349232, 'par', 'd'):            i         j  value unit\n0    seattle  new-york    2.5   km\n1    seattle   c...a    1.8   km\n3  san-diego  new-york    2.5   km\n4  san-diego   chicago    1.8   km\n5  san-diego    topeka    1.4   km} = <ixmp.backend.jdbc.JDBCBackend object at 0x79a3a1141520>._cache

ixmp/tests/backend/test_base.py:161: AssertionError

In other words, these lines don't seem to garbage collect the cached value correctly. This seems related to #463 and #504, and the issue is resolved by running s.__del__() for all python versions. @khaeru, if there's no specific reason to use del s, I'd just switch to s.__del__().

My system is running Python 3.12 with pandas 2.2.0 and JPype1 1.5.0.

codecov[bot] commented 8 months ago

Codecov Report

Attention: Patch coverage is 99.15254% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.8%. Comparing base (90cfe46) to head (12464ed). Report is 28 commits behind head on main.

Files with missing lines	Patch %	Lines
ixmp/tests/test_tutorials.py	94.4%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #521 +/- ## ===================================== Coverage 98.8% 98.8% ===================================== Files 44 44 Lines 4813 4826 +13 ===================================== + Hits 4756 4769 +13 Misses 57 57 ``` | [Files with missing lines](https://app.codecov.io/gh/iiasa/ixmp/pull/521?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa) | Coverage Δ | | |---|---|---| | [ixmp/backend/jdbc.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Fbackend%2Fjdbc.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC9iYWNrZW5kL2pkYmMucHk=) | `97.4% <100.0%> (+<0.1%)` | :arrow_up: | | [ixmp/core/platform.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Fcore%2Fplatform.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC9jb3JlL3BsYXRmb3JtLnB5) | `98.9% <ø> (ø)` | | | [ixmp/testing/data.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftesting%2Fdata.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0aW5nL2RhdGEucHk=) | `97.4% <100.0%> (+0.1%)` | :arrow_up: | | [ixmp/tests/backend/test\_base.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Fbackend%2Ftest_base.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy9iYWNrZW5kL3Rlc3RfYmFzZS5weQ==) | `98.9% <100.0%> (ø)` | | | [ixmp/tests/backend/test\_jdbc.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Fbackend%2Ftest_jdbc.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy9iYWNrZW5kL3Rlc3RfamRiYy5weQ==) | `100.0% <100.0%> (ø)` | | | [ixmp/tests/core/test\_scenario.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Fcore%2Ftest_scenario.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy9jb3JlL3Rlc3Rfc2NlbmFyaW8ucHk=) | `100.0% <100.0%> (ø)` | | | [ixmp/tests/core/test\_timeseries.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Fcore%2Ftest_timeseries.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy9jb3JlL3Rlc3RfdGltZXNlcmllcy5weQ==) | `100.0% <100.0%> (ø)` | | | [ixmp/tests/report/test\_operator.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Freport%2Ftest_operator.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy9yZXBvcnQvdGVzdF9vcGVyYXRvci5weQ==) | `100.0% <100.0%> (ø)` | | | [ixmp/tests/report/test\_reporter.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Freport%2Ftest_reporter.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy9yZXBvcnQvdGVzdF9yZXBvcnRlci5weQ==) | `100.0% <100.0%> (ø)` | | | [ixmp/tests/test\_access.py](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree&filepath=ixmp%2Ftests%2Ftest_access.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa#diff-aXhtcC90ZXN0cy90ZXN0X2FjY2Vzcy5weQ==) | `100.0% <100.0%> (ø)` | | | ... and [5 more](https://app.codecov.io/gh/iiasa/ixmp/pull/521?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa) | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/iiasa/ixmp/pull/521/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=iiasa)

khaeru commented 8 months ago

@khaeru, if there's no specific reason to use del s, I'd just switch to s.__del__().

We can answer this by first thinking about the intended behaviour of the subject/tested code:

At the end of a scope where a Scenario object (like s) is used, its reference count should decrease, maybe to zero.
This should result in Python's internal garbage collector choosing to call Scenario.__del__ (https://docs.python.org/3/reference/datamodel.html#object.__del__ —see the "Note:" in the grey box.)
That, in turn will free up memory in the Java/JDBC backend.
In other words, it's not intended that the user should call either del s or s.__del__() manually; rather, whenever they give up all references to the Scenario, the clean-up will happen.

The reason del s appears in the test function is to trigger Python's GC earlier, before the end of the test function, so that the following assertion can be made about the post clean-up state.

The comment points out that in certain cases, the intended behaviour doesn't work (we haven't diagnosed why). So s.__del__() forces the GC to occur and the test to pass. But because it is forced, the test is no longer checking that it will happen autonomously. We might expect that garbage collection does not work at all for a user with that combination of JPype and Python.

So to extend: if s.__del__() were called in the test in every case, then we no longer will test that Python's built-in reference-counting/GC can or will trigger Scenario.__del__—including with the up-to-date Python, JPype, and pandas you mention—and can't claim that this feature is working.

glatterf42 commented 8 months ago

Thanks for the explanation, so I'll revert/drop that commit and try to figure out how del s can call Scenario.__del__ again.

One other note: I've searched for other flaky markers and think I resolved another one, but there's one for test_jvm_warn where I don't know what to do. The original flaky error is described in https://github.com/iiasa/ixmp/issues/489 and it just seems that this test sometimes occurs simultaneously as another test reading a big file, which results in a warning. Maybe we can suppress this specific warning, but I'm not sure what else we could do.

glatterf42 commented 8 months ago

I'm not sure I completely agree with your analysis. The flaky function is called test_del_ts and its docstring is """Test CachingBackend.del_ts().""". However, we don't explicitly call CachingBackend.del_ts() in the test at the moment. del_ts is a guaranteed part of Scenario.__del__, but as the Note box in the docs you linked above points out:

del x doesn’t directly call x.__del__() — the former decrements the reference count for x by one, and the latter is only called when x’s reference count reaches zero.

In particular, the docs also say:

CPython implementation detail: It is possible for a reference cycle to prevent the reference count of an object from going to zero. In this case, the cycle will be later detected and deleted by the cyclic garbage collector. A common cause of reference cycles is when an exception has been caught in a local variable. The frame’s locals then reference the exception, which references its own traceback, which references the locals of all frames caught in the traceback.

Due to the precarious circumstances under which del() methods are invoked, exceptions that occur during their execution are ignored, and a warning is printed to sys.stderr instead.

So the current flaky behaviour we see only reflects exactly that: del s doesn't always call Scenario.__del__. With that, I'm wondering: should it? And do we want to test it here? Especially to the latter, the answer seems to be "No" to me. With this test, according to its name and docstring, we only want to ensure that CachingBackend.del_ts works as intended, which it does (locally at least, the flakiness is resolved by always calling backend.del_ts(s) explicitly).

I realize that this theoretically leaves part of our code untested, the part where we call Scenario.__del__. In practice, however, I expect this finalizer to be called multiple times, e.g. as part of this very function (just for a limited number of python versions).

In addition, the note that "all exceptions during the execution of del() methods are ignored" makes me wonder if we even need the try-clause in our Scenario.__del__. Apparently, these exceptions would only result in warnings anyway. Or was it a conscious choice to suppress these warnings?

glatterf42 commented 8 months ago

I'm not sure why build couldn't be installed for macos-py3.8, but I don't think that's on us. The error in windows-py3.12 is indeed still flaky and appears to be almost exactly the same as the one above: the test calls del s for a number of Scenarios, but maybe calling del_ts explicitly would be more accurate.

glatterf42 commented 8 months ago

The tests seem to be doing okay this time around, but I'm not sure we want to keep the solution. In essence, the problem was that the JDBCBackend has a .jindex attribute, which is a WeakKeyDictionary. This means that keys will be discarded automatically once no strong reference to the keys exists at runtime. But when trying to simply call backend.del_ts() in the test_jdbc::test_del_ts, this didn't always work. My guess is that just as before, a strong reference to some key could be stored as part of a reference cycle. That way, del s will not lead to s.__del__() being called (which would call backend.del_ts()), but calling backend.del_ts() is still not enough since we rely on all strong references vanishing to adapt backend.jindex and we check that len(backend.jindex) changes in our tests. So I've now adapted backend.del_ts to remove the ts from jindex after garbage collection to make sure it vanishes, even if a strong reference still exists in some reference cycle. However, this would of course remove ts from jindex irrespective of the reason a strong reference still exists.

khaeru commented 8 months ago

However, this would of course remove ts from jindex irrespective of the reason a strong reference still exists.

Hm—in theory, backend.delts is not going to get called unless (a) there are no strong references anywhere, so TimeSeries.\_del__ is triggered, or (b) the user calls it explicitly.

(But since .jindex is WeakKeyDictionary, the presence (or absence) of ts as a key in that should not be making any difference, right? Or maybe I have not fully thought through the implications of keys vs. values vs. both being weak in such collections. If the key has some dangling strong ref, then that prevents its removal from the dict and thus the value (the JPype/Java/ixmp_source object) also continues to sit there.)

In any case I think this is a safe change to make. Unless the user does (b), this is not going to throw away objects they are still using. I would put the added line in the reverse order, though:

self.jindex.pop(ts, None)
self.gc()

khaeru commented 8 months ago

Working on #524, I noticed that the Jupyter notebook tests of macos-latest-py3.12 failed several times before eventually passing. They also failed on the first run on main after merging the PR.

glatterf42 commented 2 months ago

This PR was superseded (and the work started here completed) in #543.

iiasa / ixmp