Adjust for JPype1 v1.4.1

With the release of JPype1 version 1.4.1, CI jobs have started to fail, e.g. here, with messages like:

=================================== FAILURES ===================================
________________________ TestCachingBackend.test_del_ts ________________________

self = <ixmp.tests.backend.test_base.TestCachingBackend object at 0x7f06715dc0d0>
test_mp = <ixmp.core.platform.Platform object at 0x7f05ffaf7cd0>

    def test_del_ts(self, test_mp):
        """Test CachingBackend.del_ts()."""
        # Since CachingBackend is an abstract class, test it via JDBCBackend
        backend = test_mp._backend
        cache_size_pre = len(backend._cache)

        # Load data, thereby adding to the cache
        s = make_dantzig(test_mp)
        s.par("d")

        # Cache size has increased
        assert cache_size_pre + 1 == len(backend._cache)

        # Delete the object; associated cache is freed
        del s

        # Objects were invalidated/removed from cache
>       assert cache_size_pre == len(backend._cache)
E       AssertionError: assert 0 == 1
E        +  where 1 = len({(139663741246368, 'par', 'd'):            i         j  value unit\n0    seattle  new-york    2.5   km\n1    seattle   chicago    1.7   km\n2    seattle    topeka    1.8   km\n3  san-diego  new-york    2.5   km\n4  san-diego   chicago    1.8   km\n5  san-diego    topeka    1.4   km})
E        +    where {(139663741246368, 'par', 'd'):            i         j  value unit\n0    seattle  new-york    2.5   km\n1    seattle   chicago    1.7   km\n2    seattle    topeka    1.8   km\n3  san-diego  new-york    2.5   km\n4  san-diego   chicago    1.8   km\n5  san-diego    topeka    1.4   km} = <ixmp.backend.jdbc.JDBCBackend object at 0x7f05ffaf6890>._cache

ixmp/tests/backend/test_base.py:148: AssertionError
_________________________________ test_del_ts __________________________________

    def test_del_ts():
        mp = ixmp.Platform(
            backend="jdbc",
            driver="hsqldb",
            url="jdbc:hsqldb:mem:test_del_ts",
        )

        # Number of Java objects referenced by the JDBCBackend
        N_obj = len(mp._backend.jindex)

        # Create a list of some Scenario objects
        N = 8
        scenarios = [make_dantzig(mp)]
        for i in range(1, N):
            scenarios.append(scenarios[0].clone(scenario=f"clone {i}"))

        # Number of referenced objects has increased by 8
        assert len(mp._backend.jindex) == N_obj + N

        # Pop and free the objects
        for i in range(N):
            s = scenarios.pop(0)

            message = "\n".join(map(str, gc.get_referrers(s)))

            # The variable 's' is the only reference to this Scenario object
>           assert 1 == getrefcount(s) - 1, (message, gc.garbage)
E           AssertionError: ("<frame at 0x7f05ffa8e700, file '/home/runner/work/ixmp/ixmp/ixmp/backend/jdbc.py', line 694, code check_out>
E             <frame...
E             <frame at 0x7f05ffa14b80, file '/home/runner/work/ixmp/ixmp/ixmp/testing/data.py', line 215, code make_dantzig>", [])
E           assert 1 == (19 - 1)
E            +  where 19 = getrefcount(<ixmp.core.scenario.Scenario object at 0x7f05ffaabb20>)

ixmp/tests/backend/test_jdbc.py:338: AssertionError

Essentially, some change in this JPype release has caused the Python garbage collection mechanisms, on which ixmp.backend.CachingBackend relies, to misfire.
- At the end of various method calls, function-local references to the Scenario/TimeSeries object are not released/GC'd, so that when (for example) the del s line is reached, these persisting references / non-zero refcount prevent the object itself, and associated cached item values, from being garbage-collected.
The issue did not occur with JPype 1.4.0.
See the diff between JPype1 1.4.0 and 1.4.1. The changes in the file pyjp_value.cpp appear relevant.
In #458 (commits) I encountered similar test failures and made changes such as 0db14db75928cf64ae66c2b5be4079ada5f5dc8d to adjust.
- Those errors began to appear with the release of pandas 1.5.0, but did not occur with pandas 1.4.4; in both cases with JPype1 1.4.0. See here.
- In other words, a change in pandas with no change in JPype1 caused the errors to begin to appear.
- From this, I guess this may have something to do with C/C++-level handling of memory and references; the way these are overridden or adjusted by C/C++ code in pandas and in JPype; and perhaps interactions of the two.

To mitigate

Avoid using ixmp with JPype1 1.4.1 if the CachingBackend behaviour is needed.
[x] Adjust test suite to exclude this version → #464

To fix

[ ] Inquire upstream with JPype; maybe produce a minimum working example to reproduce; maybe wait to see if others file about a similar root issue and a fix is issued.
[ ] Make any necessary adjustments in ixmp per se; then remove the above exclusion.

There was no intended change of behavior in JPype. But we had to change mechanisms. If the new mechanism is triggering a referencing problem, please try to replicate.

Totally understood, and thank you for the comment.

As the description says, we can try to condense a MWE from our somewhat complicated code base—only, no bandwidth to do this right now. This issue is just to put a pin in the matter until we manage to find time.

Can you give me a rough description of your test?

For example, something like this:

1) Create a Java object instance 2) Check the ref count is 1 3) Set up a reference loop 4) Delete the original reference 5) Call garbage collector 6) Check if hash of item appears in GC list

(which is incidentally one of the patterns that I check).

Most important details....

What is the type of objects that was created (class, object, primitive, proxy, string, exception)?
How was it structurally in relationships to others? (we created a class, then an object, then a proxy, then placed the proxy in a container, etc)
How was it determined if the object was leaking?
Does the leak checker fail on one version or multiple? (We were forced to change patterns for Py 3.11, but the older ones could use a different pattern?
Last, was an exception part of the process? (We changed both the value creation and the exception frame generator. Both are suspects as the gc could be triggering on a Frame creation rather than a JPype object create.)

We have built in leak checkers for objects, strings, and primitives (brute force create a lot and make sure they go away), but that can miss structural defects in which the order, timing, or connections of an object causes an issue.

With the release of JPype 1.5.0, this has become relevant again since all our workflows on Pyhon < 3.11 are breaking. This is because the temporary workaround specifies pip install "JPype1 != 1.4.1", which is also satisfied by installing 1.5.0. Interestingly, 1.5.0 also works for Python 3.11. For the sake of completeness, there are also other other minor version changes from the last working runs to the first failed ones: ipykernel went from 6.27.1 to 6.28.0, jupyter-core from 5.5.1 to 5.6.0, and jupyter-server-terminals from 0.5.0 to 0.5.1. However, they all seem unrelated, especially seeing that the error messages are exactly the same again as described above here. So a path to mitigation would be clear: instead of temporarily enforcing JPype1 != 1.4.1, enforce JPype1 < 1.4.1 or even JPype1 == 1.4.0, both of which would not automatically test if potential newer versions than 1.5.0 might resolve this issue. So a proper fix would be desirable and since you've taken an interest before, @Thrameos, maybe you could take another look? I'm not familiar with Java, unfortunately, but I'll do my best to break down our test to answer your questions. Any help is much appreciated :)

TL;DR: this has gotten and taken much longer than anticipated. For my money, the issue can likely be explained by understanding the changes you made for Python 3.11. Jpype1 1.5.0 (and 1.4.1, for that matter) work like expected for Python 3.11, it's only failing on older versions on our side.

The first failing test is located in ixmp/tests/backend/test_base.py:

    def test_del_ts(self, test_mp):
        """Test CachingBackend.del_ts()."""
        # Since CachingBackend is an abstract class, test it via JDBCBackend
        backend = test_mp._backend
        cache_size_pre = len(backend._cache)

        # Load data, thereby adding to the cache
        s = make_dantzig(test_mp)
        s.par("d")

        # Cache size has increased
        assert cache_size_pre + 1 == len(backend._cache)

        # Delete the object; associated cache is freed
        del s

        # Objects were invalidated/removed from cache
        assert cache_size_pre == len(backend._cache)

self is an instance of the test class and should not matter, but test_mp is an ixmp.core.platform.Platform object. This is a python class in our modelling framework that connects models to a data-storing backend. The backend, in turn, is defined in the test_mp fixture as an instance of the JDBC backend. This is another python class (inheriting from CachingBackend) that uses JPype/JDBC to connect to an Oracle or HyperSQL database. In our test case, if that's relevant, we are using a local HyperSQL database. backend._cache is just a dictionary, then. Next, the function make_dantzig creates a scenario containing the data of the Dantzig transportation problem. The line s.par("d") accesses the data of the distance parameter d, which looks (in python) something like this:

i            j           value unit
seattle      new-york    2.5   km
seattle      chicago     1.7   km
seattle      topeka      1.8   km
san-diego    new-york    2.5   km
san-diego    chicago     1.8   km
san-diego    topeka      1.4   km

(Just noticing: these distances are supposed to represent thousands of miles rather than km, so our units are wrong here, but that hardly matters for the error, I think.) On the python side, this is stored in the backend._cache dict such that one value is added, which is confirmed by the first successful assert line. Then, we are deleting this whole scenario s again and expect this action to free the backend._cache, too, which is not happening for JPype1 >= 1.4.1. The del function has been overwritten for the TimeSeries class, which the Scenario class inherits from, so del s will call the JDBCBackend del_ts function, which first calls the general CachingBackend.del_ts(), which just calls its own cache_invalidate(). This, finally, should remove all values from the backend._cache dict (but might return None if some key could not be found). Back in JDBCBackend.del_ts(), the last step is to call JDBCBackend.gc(), which shows some explicit mention of jpype:

    def gc(cls):
        if _GC_AGGRESSIVE:
            # log.debug('Collect garbage')
            try:
                java.System.gc()
            except jpype.JVMNotRunning:
                pass
            gc.collect()

Lastly, the overwriting of the del function only calls this whole thing as part of a try clause, excepting (AttributeError, ReferenceError) and moving on since we assume the object has already been garbage-collected. So if one of these is raised without backend._cache being emptied, this would probably result in exactly the error we are seeing. (Not sure this can interfere here, but platform.__getattr__ has been overwritten and could raise an AttributeError.)

On the java side, I can't tell you much about what's happening, I'm afraid. When the Scenario object is initialized, it also initializes the JDBCBackend, which does things like start_jvm(), and

        try:
            self.jobj = java.Platform("Python", properties)
        except java.NoClassDefFoundError as e:  # pragma: no cover
            raise NameError(
                f"{e}\nCheck that dependencies of ixmp.jar are "
                f"included in {Path(__file__).parents[2] / 'lib'}"
            )
        except jpype.JException as e:  # pragma: no cover
            # Handle Java exceptions
            jclass = e.__class__.__name__
            if jclass.endswith("HikariPool.PoolInitializationException"):
                redacted = copy(kwargs)
                redacted.update(user="(HIDDEN)", password="(HIDDEN)")
                msg = f"unable to connect to database:\n{repr(redacted)}"
            elif jclass.endswith("FlywayException"):
                msg = "when initializing database:"
                if "applied migration" in e.args[0]:
                    msg += (
                        "\n\nThe schema of the database does not match the schema of "
                        "this version of ixmp. To resolve, either install the version "
                        "of ixmp used to create the database, or delete it and retry."
                    )
            else:
                _raise_jexception(e)
            raise RuntimeError(f"{msg}\n(Java: {jclass})")

, so it is checking for some exceptions. The next step, then, finally gets deep into java territory: s.par("d") calls s._backend("item_get_elements", "par", name, filters) (with name="d" and filters=None), which calls s.platform._backend(s, method, *args, **kwargs) (with method="item_get_elements" and the rest passed along as *args). s.platform._backend is an instance of (through inheritance) the abstract Backend class, which does specify a __call__(self, obj, method, *args, **kwargs) function to return getattr(self, method)(obj, *args, **kwargs). With self=JDBCBackend, this means the function in question is this:

# type="par", name="d"
    def item_get_elements(self, s, type, name, filters=None):  # noqa: C901
        if filters:
            # Convert filter elements to strings
            filters = {dim: as_str_list(ele) for dim, ele in filters.items()}

        try:
            # Retrieve the cached value with this exact set of filters
            return self.cache_get(s, type, name, filters)
        except KeyError:
            pass  # Cache miss

        try:
            # Retrieve a cached, unfiltered value of the same item
            unfiltered = self.cache_get(s, type, name, None)
        except KeyError:
            pass  # Cache miss
        else:
            # Success; filter and return
            return filtered(unfiltered, filters)

        # Failed to load item from cache

        # Retrieve the item
        item = self._get_item(s, type, name, load=True)
        idx_names = list(item.getIdxNames())
        idx_sets = list(item.getIdxSets())

        # Get list of elements, using filters if provided
        if filters is not None:
            jFilter = java.HashMap()

            for idx_name, values in filters.items():
                # Retrieve the elements of the index set as a list
                idx_set = idx_sets[idx_names.index(idx_name)]
                elements = self.item_get_elements(s, "set", idx_set).tolist()

                # Filter for only included values and store
                filtered_elements = filter(lambda e: e in values, elements)
                jFilter.put(idx_name, to_jlist(filtered_elements))

            jList = item.getElements(jFilter)
        else:
            jList = item.getElements()

        if item.getDim() > 0:
            # Mapping set or multi-dimensional equation, parameter, or variable
            columns = copy(idx_names)

            # Prepare dtypes for index columns
            dtypes = {}
            for idx_name, idx_set in zip(columns, idx_sets):
                # NB using categoricals could be more memory-efficient, but requires
                #    adjustment of tests/documentation. See
                #    https://github.com/iiasa/ixmp/issues/228
                # dtypes[idx_name] = CategoricalDtype(
                #     self.item_get_elements(s, 'set', idx_set))
                dtypes[idx_name] = str

            # Prepare dtypes for additional columns
            if type == "par":
                columns.extend(["value", "unit"])
                dtypes.update(value=float, unit=str)
                # Same as above
                # dtypes['unit'] = CategoricalDtype(self.jobj.getUnitList())
            elif type in ("equ", "var"):
                columns.extend(["lvl", "mrg"])
                dtypes.update(lvl=float, mrg=float)

            # Copy vectors from Java into pd.Series to form DataFrame columns
            columns = []

            def _get(method, name, *args):
                columns.append(
                    pd.Series(
                        # NB [:] causes JPype to use a faster code path
                        getattr(item, f"get{method}")(*args, jList)[:],
                        dtype=dtypes[name],
                        name=name,
                    )
                )

            # Index columns
            for i, idx_name in enumerate(idx_names):
                _get("Col", idx_name, i)

            # Data columns
            if type == "par":
                _get("Values", "value")
                _get("Units", "unit")
            elif type in ("equ", "var"):
                _get("Levels", "lvl")
                _get("Marginals", "mrg")

            result = pd.concat(columns, axis=1, copy=False)
        elif type == "set":
            # Index sets
            # dtype=object is to silence a warning in pandas 1.0
            result = pd.Series(item.getCol(0, jList)[:], dtype=object)
        elif type == "par":
            # Scalar parameter
            result = dict(
                value=float(item.getScalarValue().floatValue()),
                unit=str(item.getScalarUnit()),
            )
        elif type in ("equ", "var"):
            # Scalar equation or variable
            result = dict(
                lvl=float(item.getScalarLevel().floatValue()),
                mrg=float(item.getScalarMarginal().floatValue()),
            )

        # Store cache
        self.cache(s, type, name, filters, result)

        return result

From the ixmp java source code (which is private for some reason, I believe), I would say that Parameter is a class: public class Parameter extends Item {, but I can't really tell you without reading up much more on java if we set up a reference loop or some such, sorry. In the test, we then call the del construction as described above, which does not mention java or jpype a lot.

@glatterf42 thanks for digging in—I hope that was instructive!

The fault probably lies in code that I originally added in #213, several years ago. The basic purpose of this code is to aggressively free up references to Python objects (like Scenario, TimeSeries, etc.) so that they and the associated ixmp_source Java objects (via JPype) can be garbage-collected, freeing memory. This was deemed necessary for larger applications, like running many instances of the MESSAGEix-GLOBIOM global model in older workflows (using the scenario_runner/runscript_main pattern).

I will open a PR to remove the workaround for the current issue and adjust/update the tests and/or caching/memory-management code. Then we can discuss whatever the solution turns out to be.

I spent some further time investigating this, including using refcycle. After del s here: https://github.com/iiasa/ixmp/blob/d92cfcdb5b0156ecf769a2920ec644c58e520755/ixmp/tests/backend/test_base.py#L148-L149 …I inserted code like:

import gc
gc.collect()
gc.collect()
gc.collect()

import sys
import traceback

import refcycle

import ixmp

snapshot = refcycle.snapshot()
scenarios = snapshot.find_by(lambda obj: isinstance(obj, ixmp.Scenario))
s2 = scenarios[0]
print(f"{s2 = }")
snapshot.ancestors(s2).export_image("refcycle-main.svg")

tbs = list(
    snapshot.ancestors(s2).find_by(
        lambda obj: type(obj).__name__ == "traceback"
    )
)
print(f"{tbs = }")
del snapshot
gc.collect()

for i, obj in enumerate(tbs):
    print(f"{i} {obj}")
    print(f"{obj.tb_frame = }")
    traceback.print_tb(obj, file=sys.stdout)

This produced output like the following: refcycle-main

In short:

At that point in the test, there are six builtins.traceback objects that exist.
These appear to originate in the Java code, rather than in the Python code (first frame object referenced by each traceback object).
Each of those frame stacks ultimately contain 1 or more references to the s Scenario instance.
These are what make the reference count non-zero, and prevent s from being garbage-collected.
Again, this occurs only with JPype ≥1.4.1 (including 1.5.0) and Python ≤ 3.10. It does not (see the CI runs on #504) occur with Python 3.11 or 3.12. I don't know why this is.
I also spent several hours fiddling with the exception handling. For instance, because tracebacks are associated with exceptions, I tried aggressively deleting and GC'ing jpype.JException instances and returning only Python exceptions in jdbc.py. Nothing seemed to make those traceback objects disappear.

iiasa / ixmp

Adjust for JPype1 v1.4.1 #463

To mitigate

To fix