CTracer fails to deallocate bound methods references

dvarrazzo commented 2 years ago

Describe the bug The CTracer object leaks some objects. This affects tests which try to detect objects leaks in the code they test.

To Reproduce

In a clean virtualenv, on a Linux system with the libpq5 package installed:

git clone git@github.com:psycopg/psycopg.git
cd psycopg
git checkout 4b6b21fe362f92117f231bdfcdf8fa1a33d56add
pip install -e ./psycopg/[test]  # installs pytest-cov 3.0.0
# a connection string to an existing database which will be used for testing
export PSYCOPG_TEST_DSN="dbname=psycopg3_test port=54314"
pytest "tests/test_copy.py::test_copy_from_leaks[Format.BINARY-True]" --cov

How can we reproduce the problem? Please be specific. Don't just link to a failing CI job. Answer the questions below:

What version of Python are you using? Python 3.8
What version of coverage.py shows the problem? The output of coverage debug sys is helpful. Coverage 6.1.2

$ coverage debug sys

-- sys -------------------------------------------------------
               coverage_version: 6.1.2
                coverage_module: /home/piro/dev/tmp/env/lib/python3.8/site-packages/coverage/__init__.py
                         tracer: -none-
                        CTracer: available
           plugins.file_tracers: -none-
            plugins.configurers: -none-
      plugins.context_switchers: -none-
              configs_attempted: .coveragerc
                                 setup.cfg
                                 tox.ini
                                 pyproject.toml
                   configs_read: /home/piro/dev/tmp/psycopg/tox.ini
                                 /home/piro/dev/tmp/psycopg/pyproject.toml
                    config_file: /home/piro/dev/tmp/psycopg/pyproject.toml
                config_contents: b'[build-system]\nrequires = ["setuptools>=49.2.0", "wheel>=0.34.2"]\nbuild-backend = "setuptools.build_meta"\n\n[tool.black]\nline-length = 79\n\n[tool.pytest.ini_options]\nfilterwarnings = [\n    "error",\n    # The zoneinfo warning ignore is only required on Python 3.6\n    "ignore::DeprecationWarning:backports.zoneinfo._common",\n]\ntestpaths=[\n    "tests",\n]\n# Note: On Travis they these options seem to leak objects\n# log_format = "%(asctime)s.%(msecs)03d %(levelname)-8s %(name)s:%(filename)s:%(lineno)d %(message)s"\n# log_level = "DEBUG"\n\n[tool.coverage.run]\nsource = [\n    "psycopg/psycopg",\n    "psycopg_pool/psycopg_pool",\n]\n[tool.coverage.report]\nexclude_lines = [\n    "if TYPE_CHECKING:",\n    \'\\.\\.\\.$\',\n]\n\n[tool.mypy]\nfiles = [\n    "psycopg/psycopg",\n    "psycopg_pool/psycopg_pool",\n    "psycopg_c/psycopg_c",\n    "tests",\n]\nwarn_unused_ignores = true\nshow_error_codes = true\nstrict = true\n\n[[tool.mypy.overrides]]\nmodule = [\n    "shapely.*",\n]\nignore_missing_imports = true\n\n[[tool.mypy.overrides]]\nmodule = "tenacity.*"\nimplicit_reexport = true\n\n[[tool.mypy.overrides]]\nmodule = "uvloop"\nignore_missing_imports = true\n\n[[tool.mypy.overrides]]\nmodule = "tests.*"\ncheck_untyped_defs = true\ndisallow_untyped_defs = false\ndisallow_untyped_calls = false\n'
                      data_file: -none-
                         python: 3.8.10 (default, Sep 28 2021, 16:10:42) [GCC 9.3.0]
                       platform: Linux-5.11.0-38-generic-x86_64-with-glibc2.29
                 implementation: CPython
                     executable: /home/piro/dev/tmp/env/bin/python
                   def_encoding: utf-8
                    fs_encoding: utf-8
                            pid: 338650
                            cwd: /home/piro/dev/tmp/psycopg
                           path: /home/piro/dev/tmp/env/bin
                                 /usr/lib/python38.zip
                                 /usr/lib/python3.8
                                 /usr/lib/python3.8/lib-dynload
                                 /home/piro/dev/tmp/env/lib/python3.8/site-packages
                                 /home/piro/dev/tmp/psycopg/psycopg
                    environment: HOME = /home/piro
                   command_line: /home/piro/dev/tmp/env/bin/coverage debug sys
                sqlite3_version: 2.6.0
         sqlite3_sqlite_version: 3.31.1
             sqlite3_temp_store: 0
        sqlite3_compile_options: COMPILER=gcc-9.3.0; ENABLE_COLUMN_METADATA; ENABLE_DBSTAT_VTAB
                                 ENABLE_FTS3; ENABLE_FTS3_PARENTHESIS; ENABLE_FTS3_TOKENIZER
                                 ENABLE_FTS4; ENABLE_FTS5; ENABLE_JSON1
                                 ENABLE_LOAD_EXTENSION; ENABLE_PREUPDATE_HOOK; ENABLE_RTREE
                                 ENABLE_SESSION; ENABLE_STMTVTAB; ENABLE_UNLOCK_NOTIFY
                                 ENABLE_UPDATE_DELETE_LIMIT; HAVE_ISNAN; LIKE_DOESNT_MATCH_BLOBS
                                 MAX_SCHEMA_RETRY=25; MAX_VARIABLE_NUMBER=250000; OMIT_LOOKASIDE
                                 SECURE_DELETE; SOUNDEX; TEMP_STORE=1
                                 THREADSAFE=1; USE_URI

Other info in the repro.

Expected behavior The tests shouldn't be altered by coverage. That specific test (as well as the other two in the same parametrize set) fails with:

E               AssertionError: objects leaked: 2, 2
E               assert 48177 == 48179

Which means that running a certain function under test 3 times there are two leaked objects between the first and second run and two leaked objects between second and third.

Applying this patch on the psycopg test will reveal the objects leaked:

patch

```diff diff --git a/tests/test_copy.py b/tests/test_copy.py index 05448f4a..6b1b60f8 100644 --- a/tests/test_copy.py +++ b/tests/test_copy.py @@ -638,11 +638,19 @@ def test_copy_from_leaks(dsn, faker, fmt, set_types, retries): for retry in retries: with retry: n = [] + meths = [{} for _ in range(3)] for i in range(3): work() gc_collect() - n.append(len(gc.get_objects())) - + objs = gc.get_objects() + n.append(len(objs)) + for obj in objs: + if str(type(obj)) == "": + meths[i][id(obj)] = obj + del objs + + print("diff 0-1", [meths[1][k] for k in set(meths[1]) - set(meths[0])]) + print("diff 1-2", [meths[2][k] for k in set(meths[2]) - set(meths[1])]) assert ( n[0] == n[1] == n[2] ), f"objects leaked: {n[1] - n[0]}, {n[2] - n[1]}" ```

e.g.

diff 0-1 [<bound method Collector.disable_plugin of <Collector at 0x7fdb868c3d30: CTracer>>, <bound method Collector.switch_context of <Collector at 0x7fdb868c3d30: CTracer>>]
diff 1-2 [<bound method Collector.switch_context of <Collector at 0x7fdb868c3d30: CTracer>>, <bound method Collector.disable_plugin of <Collector at 0x7fdb868c3d30: CTracer>>]

No other test seems affected (which is curious, as there are other tests using the same technique to detect leaks).

Additional context Ref. https://github.com/psycopg/psycopg/pull/159

dvarrazzo commented 2 years ago

I apologise, I forgot a bit of repro. You will need a postgres database somewhere and the connection string exposed in the PSYCOPG_TEST_DSN env var before running pytest:

echo $PSYCOPG_TEST_DSN
dbname=psycopg3_test port=54314

nedbat commented 2 years ago

Thanks for the interesting report... I'm trying to create a simpler test to show the problem, and haven't managed it yet. The two objects you list are created when the collector is started. But the 3-times loop in your test shouldn't start new collectors, so even if starting one leaks two objects, I don't see how it could be leaking more than two.

Is there something in your tests that would be starting and stopping the coverage measurement?

dvarrazzo commented 2 years ago

Nothing that explicitly touches coverage, no. What I have in mind is that copy_from uses a worker thread, whereas copy_to doesn't, and the issue only happens on copy_from.

dvarrazzo commented 2 years ago

I think it's related to threads, yes, because applying the following patch to psycopg (which disables the use of the worker thread) makes the issue disappear.

diff --git a/psycopg/psycopg/copy.py b/psycopg/psycopg/copy.py
index 8ee90b1b..7d30c746 100644
--- a/psycopg/psycopg/copy.py
+++ b/psycopg/psycopg/copy.py
@@ -273,22 +273,11 @@ class Copy(BaseCopy["Connection[Any]"]):
         if not data:
             return

-        if not self._worker:
-            # warning: reference loop, broken by _write_end
-            self._worker = threading.Thread(target=self.worker)
-            self._worker.daemon = True
-            self._worker.start()
-
-        self._queue.put(data)
+        self.connection.wait(copy_to(self._pgconn, data))

     def _write_end(self) -> None:
         data = self.formatter.end()
         self._write(data)
-        self._queue.put(None)
-
-        if self._worker:
-            self._worker.join()
-            self._worker = None  # break the loop

 class AsyncCopy(BaseCopy["AsyncConnection[Any]"]):

nedbat commented 2 years ago

OK, I can reproduce what you are seeing, but I'm not sure what about my code is making it happen. Those two attributes are managed very simply. When I add another attribute, and assign object() to it, it doesn't leak. Then when I add a bound method to it, it appears in the leak count. Something is different about the bound methods, and I'm not sure I can do anything about it.

Also, my test leaks slightly differently than yours: one object plus two bound methods per thread, and only between 1 and 2, not between 0 and 1. That is, if I start 3 threads, I get, objects leaked: 0, 10. How did you analyze the leaks to know to only consider methods in the updated code?

nedbat commented 2 years ago

My experiments are here: https://github.com/nedbat/coveragepy/tree/nedbat/bug1283

Oddly, if I comment out the lines that deallocate the tracer attributes, it doesn't change the reported leaks at all....

dvarrazzo commented 2 years ago

I think what is happening there is that the bound methods constitute a reference loop, so you need to enable GC on your class by implementing tp_clear.

nedbat commented 2 years ago

I've tried adding the HAVE_GC flag, and implementing tp_traverse and tp_clear, but it isn't changing the behavior: https://github.com/nedbat/coveragepy/tree/nedbat/bug1283

dvarrazzo commented 2 years ago

I don't know your code, how are switch_context and disable_plugin assigned to the instance. But maybe it would be better to have them as regular Python functions, or unbound members, not as bound members (which creates a closure object), and call them passing self explicitly?

nedbat commented 2 years ago

I thought about that, but then the tracer would need to hold a reference to the Collector (self). So either the bound method references the collector, or I keep an explicit reference to it. Either way, there's a reference chain from the CTracer to the Collector, and also from the Collector back to the CTracer.

dvarrazzo commented 2 years ago

That's easier to break using the GC though, or you could use weakrefs instead. I have surely done both these things in psycopg2, which is a hand-writte C extension, and definitely has a few complex objects and creates and manages a few objects loops; conversely, even in its dynamic nature, in psycopg2 I've never needed to create/use bound methods in C.

Alternatively, I don't know if there is a way to "unbind a method" by setting some of its attribute to None and which you could do in tp_traverse... but I think we are getting in the realm of the implementation details here.

nedbat / coveragepy

CTracer fails to deallocate bound methods references #1283