Closed khaeru closed 10 months ago
There was no intended change of behavior in JPype. But we had to change mechanisms. If the new mechanism is triggering a referencing problem, please try to replicate.
Totally understood, and thank you for the comment.
As the description says, we can try to condense a MWE from our somewhat complicated code base—only, no bandwidth to do this right now. This issue is just to put a pin in the matter until we manage to find time.
Can you give me a rough description of your test?
For example, something like this:
1) Create a Java object instance 2) Check the ref count is 1 3) Set up a reference loop 4) Delete the original reference 5) Call garbage collector 6) Check if hash of item appears in GC list
(which is incidentally one of the patterns that I check).
Most important details....
We have built in leak checkers for objects, strings, and primitives (brute force create a lot and make sure they go away), but that can miss structural defects in which the order, timing, or connections of an object causes an issue.
With the release of JPype 1.5.0, this has become relevant again since all our workflows on Pyhon < 3.11 are breaking. This is because the temporary workaround specifies pip install "JPype1 != 1.4.1"
, which is also satisfied by installing 1.5.0. Interestingly, 1.5.0 also works for Python 3.11. For the sake of completeness, there are also other other minor version changes from the last working runs to the first failed ones: ipykernel went from 6.27.1 to 6.28.0, jupyter-core from 5.5.1 to 5.6.0, and jupyter-server-terminals from 0.5.0 to 0.5.1. However, they all seem unrelated, especially seeing that the error messages are exactly the same again as described above here.
So a path to mitigation would be clear: instead of temporarily enforcing JPype1 != 1.4.1
, enforce JPype1 < 1.4.1
or even JPype1 == 1.4.0
, both of which would not automatically test if potential newer versions than 1.5.0 might resolve this issue. So a proper fix would be desirable and since you've taken an interest before, @Thrameos, maybe you could take another look? I'm not familiar with Java, unfortunately, but I'll do my best to break down our test to answer your questions. Any help is much appreciated :)
TL;DR: this has gotten and taken much longer than anticipated. For my money, the issue can likely be explained by understanding the changes you made for Python 3.11. Jpype1 1.5.0 (and 1.4.1, for that matter) work like expected for Python 3.11, it's only failing on older versions on our side.
The first failing test is located in ixmp/tests/backend/test_base.py:
def test_del_ts(self, test_mp):
"""Test CachingBackend.del_ts()."""
# Since CachingBackend is an abstract class, test it via JDBCBackend
backend = test_mp._backend
cache_size_pre = len(backend._cache)
# Load data, thereby adding to the cache
s = make_dantzig(test_mp)
s.par("d")
# Cache size has increased
assert cache_size_pre + 1 == len(backend._cache)
# Delete the object; associated cache is freed
del s
# Objects were invalidated/removed from cache
assert cache_size_pre == len(backend._cache)
self
is an instance of the test class and should not matter, but test_mp
is an ixmp.core.platform.Platform object. This is a python class in our modelling framework that connects model
s to a data-storing backend
. The backend
, in turn, is defined in the test_mp
fixture as an instance of the JDBC backend. This is another python class (inheriting from CachingBackend) that uses JPype/JDBC to connect to an Oracle or HyperSQL database. In our test case, if that's relevant, we are using a local HyperSQL database. backend._cache
is just a dictionary, then.
Next, the function make_dantzig
creates a scenario containing the data of the Dantzig transportation problem. The line s.par("d")
accesses the data of the distance parameter d
, which looks (in python) something like this:
i j value unit
seattle new-york 2.5 km
seattle chicago 1.7 km
seattle topeka 1.8 km
san-diego new-york 2.5 km
san-diego chicago 1.8 km
san-diego topeka 1.4 km
(Just noticing: these distances are supposed to represent thousands of miles rather than km, so our units are wrong here, but that hardly matters for the error, I think.)
On the python side, this is stored in the backend._cache
dict such that one value is added, which is confirmed by the first successful assert
line. Then, we are deleting this whole scenario s
again and expect this action to free the backend._cache
, too, which is not happening for JPype1 >= 1.4.1.
The del
function has been overwritten for the TimeSeries
class, which the Scenario
class inherits from, so del s
will call the JDBCBackend del_ts
function, which first calls the general CachingBackend.del_ts()
, which just calls its own cache_invalidate()
. This, finally, should remove all values from the backend._cache
dict (but might return None
if some key could not be found). Back in JDBCBackend.del_ts()
, the last step is to call JDBCBackend.gc()
, which shows some explicit mention of jpype:
def gc(cls):
if _GC_AGGRESSIVE:
# log.debug('Collect garbage')
try:
java.System.gc()
except jpype.JVMNotRunning:
pass
gc.collect()
Lastly, the overwriting of the del
function only calls this whole thing as part of a try clause, excepting (AttributeError, ReferenceError)
and moving on since we assume the object has already been garbage-collected. So if one of these is raised without backend._cache
being emptied, this would probably result in exactly the error we are seeing.
(Not sure this can interfere here, but platform.__getattr__
has been overwritten and could raise an AttributeError
.)
On the java side, I can't tell you much about what's happening, I'm afraid. When the Scenario
object is initialized, it also initializes the JDBCBackend
, which does things like start_jvm()
, and
try:
self.jobj = java.Platform("Python", properties)
except java.NoClassDefFoundError as e: # pragma: no cover
raise NameError(
f"{e}\nCheck that dependencies of ixmp.jar are "
f"included in {Path(__file__).parents[2] / 'lib'}"
)
except jpype.JException as e: # pragma: no cover
# Handle Java exceptions
jclass = e.__class__.__name__
if jclass.endswith("HikariPool.PoolInitializationException"):
redacted = copy(kwargs)
redacted.update(user="(HIDDEN)", password="(HIDDEN)")
msg = f"unable to connect to database:\n{repr(redacted)}"
elif jclass.endswith("FlywayException"):
msg = "when initializing database:"
if "applied migration" in e.args[0]:
msg += (
"\n\nThe schema of the database does not match the schema of "
"this version of ixmp. To resolve, either install the version "
"of ixmp used to create the database, or delete it and retry."
)
else:
_raise_jexception(e)
raise RuntimeError(f"{msg}\n(Java: {jclass})")
, so it is checking for some exceptions. The next step, then, finally gets deep into java territory:
s.par("d")
calls s._backend("item_get_elements", "par", name, filters)
(with name="d"
and filters=None
), which calls s.platform._backend(s, method, *args, **kwargs)
(with method="item_get_elements"
and the rest passed along as *args
). s.platform._backend
is an instance of (through inheritance) the abstract Backend
class, which does specify a __call__(self, obj, method, *args, **kwargs)
function to return getattr(self, method)(obj, *args, **kwargs)
. With self=JDBCBackend
, this means the function in question is this:
# type="par", name="d"
def item_get_elements(self, s, type, name, filters=None): # noqa: C901
if filters:
# Convert filter elements to strings
filters = {dim: as_str_list(ele) for dim, ele in filters.items()}
try:
# Retrieve the cached value with this exact set of filters
return self.cache_get(s, type, name, filters)
except KeyError:
pass # Cache miss
try:
# Retrieve a cached, unfiltered value of the same item
unfiltered = self.cache_get(s, type, name, None)
except KeyError:
pass # Cache miss
else:
# Success; filter and return
return filtered(unfiltered, filters)
# Failed to load item from cache
# Retrieve the item
item = self._get_item(s, type, name, load=True)
idx_names = list(item.getIdxNames())
idx_sets = list(item.getIdxSets())
# Get list of elements, using filters if provided
if filters is not None:
jFilter = java.HashMap()
for idx_name, values in filters.items():
# Retrieve the elements of the index set as a list
idx_set = idx_sets[idx_names.index(idx_name)]
elements = self.item_get_elements(s, "set", idx_set).tolist()
# Filter for only included values and store
filtered_elements = filter(lambda e: e in values, elements)
jFilter.put(idx_name, to_jlist(filtered_elements))
jList = item.getElements(jFilter)
else:
jList = item.getElements()
if item.getDim() > 0:
# Mapping set or multi-dimensional equation, parameter, or variable
columns = copy(idx_names)
# Prepare dtypes for index columns
dtypes = {}
for idx_name, idx_set in zip(columns, idx_sets):
# NB using categoricals could be more memory-efficient, but requires
# adjustment of tests/documentation. See
# https://github.com/iiasa/ixmp/issues/228
# dtypes[idx_name] = CategoricalDtype(
# self.item_get_elements(s, 'set', idx_set))
dtypes[idx_name] = str
# Prepare dtypes for additional columns
if type == "par":
columns.extend(["value", "unit"])
dtypes.update(value=float, unit=str)
# Same as above
# dtypes['unit'] = CategoricalDtype(self.jobj.getUnitList())
elif type in ("equ", "var"):
columns.extend(["lvl", "mrg"])
dtypes.update(lvl=float, mrg=float)
# Copy vectors from Java into pd.Series to form DataFrame columns
columns = []
def _get(method, name, *args):
columns.append(
pd.Series(
# NB [:] causes JPype to use a faster code path
getattr(item, f"get{method}")(*args, jList)[:],
dtype=dtypes[name],
name=name,
)
)
# Index columns
for i, idx_name in enumerate(idx_names):
_get("Col", idx_name, i)
# Data columns
if type == "par":
_get("Values", "value")
_get("Units", "unit")
elif type in ("equ", "var"):
_get("Levels", "lvl")
_get("Marginals", "mrg")
result = pd.concat(columns, axis=1, copy=False)
elif type == "set":
# Index sets
# dtype=object is to silence a warning in pandas 1.0
result = pd.Series(item.getCol(0, jList)[:], dtype=object)
elif type == "par":
# Scalar parameter
result = dict(
value=float(item.getScalarValue().floatValue()),
unit=str(item.getScalarUnit()),
)
elif type in ("equ", "var"):
# Scalar equation or variable
result = dict(
lvl=float(item.getScalarLevel().floatValue()),
mrg=float(item.getScalarMarginal().floatValue()),
)
# Store cache
self.cache(s, type, name, filters, result)
return result
From the ixmp java source code (which is private for some reason, I believe), I would say that Parameter
is a class: public class Parameter extends Item {
, but I can't really tell you without reading up much more on java if we set up a reference loop or some such, sorry.
In the test, we then call the del
construction as described above, which does not mention java or jpype a lot.
@glatterf42 thanks for digging in—I hope that was instructive!
The fault probably lies in code that I originally added in #213, several years ago. The basic purpose of this code is to aggressively free up references to Python objects (like Scenario, TimeSeries, etc.) so that they and the associated ixmp_source Java objects (via JPype) can be garbage-collected, freeing memory. This was deemed necessary for larger applications, like running many instances of the MESSAGEix-GLOBIOM global model in older workflows (using the scenario_runner/runscript_main pattern).
I will open a PR to remove the workaround for the current issue and adjust/update the tests and/or caching/memory-management code. Then we can discuss whatever the solution turns out to be.
I spent some further time investigating this, including using refcycle. After del s
here: https://github.com/iiasa/ixmp/blob/d92cfcdb5b0156ecf769a2920ec644c58e520755/ixmp/tests/backend/test_base.py#L148-L149
…I inserted code like:
import gc
gc.collect()
gc.collect()
gc.collect()
import sys
import traceback
import refcycle
import ixmp
snapshot = refcycle.snapshot()
scenarios = snapshot.find_by(lambda obj: isinstance(obj, ixmp.Scenario))
s2 = scenarios[0]
print(f"{s2 = }")
snapshot.ancestors(s2).export_image("refcycle-main.svg")
tbs = list(
snapshot.ancestors(s2).find_by(
lambda obj: type(obj).__name__ == "traceback"
)
)
print(f"{tbs = }")
del snapshot
gc.collect()
for i, obj in enumerate(tbs):
print(f"{i} {obj}")
print(f"{obj.tb_frame = }")
traceback.print_tb(obj, file=sys.stdout)
This produced output like the following:
In short:
s
Scenario instance.s
from being garbage-collected.
With the release of JPype1 version 1.4.1, CI jobs have started to fail, e.g. here, with messages like:
del s
line is reached, these persisting references / non-zero refcount prevent the object itself, and associated cached item values, from being garbage-collected.To mitigate
To fix