vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Refcount leak to underlying array when deleting dataframe #2323

Open schwingkopf opened 1 year ago

schwingkopf commented 1 year ago

I'm trying to use vaex with numpy arrays that reference shared memory and experience problems when trying to unlink the shared memory. Here a minimal reproducing example:

import numpy as np
from multiprocessing import shared_memory
import time
import vaex

shm = shared_memory.SharedMemory(create=True, size=8)
arr = np.frombuffer(shm.buf, dtype="uint8", count=8)
df = vaex.from_dict(dict(x=arr))

del arr
del df
time.sleep(2)

shm.close()
shm.unlink()

Execution throws the following exception:

Traceback (most recent call last):
  File "<...>\memory_test.py", line 15, in <module>
    shm.close()
  File "<...>\.pyenv\pyenv-win\versions\3.9.10\lib\multiprocessing\shared_memory.py", line 227, in close
    self._mmap.close()
BufferError: cannot close exported pointers exist

It works fine when not creating the dataframe object.

It seems like vaex is still keeping a reference to the array/shm block after deleting the dataframe object. Is that a bug or is there a recommended way to delete all references?

Software information

schwingkopf commented 1 year ago

Just realized the problem from my inital post is much simpler to explain: There seems a refcounting leak in vaex dataframe:

import numpy as np
import vaex
import sys

arr = np.arange(10)
print(f"Refcount after array creation: {sys.getrefcount(arr)}")

df = vaex.from_dict(dict(x=arr))
print(f"Refcount after df creation: {sys.getrefcount(arr)}")

del df
print(f"Refcount after df deletion: {sys.getrefcount(arr)}")

prints:

Refcount after array creation: 2
Refcount after df creation: 3
Refcount after df deletion: 3

So dataframe deletion is not cleaning up its reference to the array. Is that a bug or is there any other recommended way to release the array?

schwingkopf commented 1 year ago

ok, think this is not a bug in vaex, but related to delayed garbage collection in python. Manually triggering garbage collection using gc.collect() after del df fixes the issue in both examples above.

Although I did not understand why garbage collection is delayed after going through vaex dataframe I will close the issue as most likely not related to vaex internals.

schwingkopf commented 1 year ago

After digging deeper into python garbage collection internals I think I closed this one too early..

Using tricks from https://rushter.com/blog/python-garbage-collector/ I can see that the Dataframe object after del df still has a non-zero refcount:

import time
import vaex
import ctypes
import gc

class PyObject(ctypes.Structure):
        _fields_ = [("refcnt", ctypes.c_long)]

def array_vaex_leak():
    N=int(0.5e9)
    arr = np.arange(N)
    df = vaex.from_dict(dict(x=arr))
    df_addr = id(df)
    print(f"Refcount before delete: {PyObject.from_address(df_addr).refcnt}")
    del df
    print(f"Refcount after delete: {PyObject.from_address(df_addr).refcnt}")
    gc.collect()
    print(f"Refcount after gc collect: {PyObject.from_address(df_addr).refcnt}")

array_vaex_leak()

Outputs:

Refcount before delete: 3
Refcount after delete: 2
Refcount after gc collect: 0

The fact that it gets removed when calling gc.collect() is a strong hint that a cyclic reference exists the object, preventing it's instant removal when calling del df. This needs fixing in vaex code by removing the cyclic referencing or using weakref's!

Effectively this behaves as a memory leak until the python interpreter decides to run garbage collection or user code triggers it explicitly via gc.collect() (which is a relatively costly operation, ~30ms in my example) This becomes severe when working with large arrays. In the following example code the automatic garbage collecting is not run sufficiently often, such that the system runs out of memory (at least on my machine)

import numpy as np
import time
import vaex
import os
import psutil

def array_vaex_leak():
    N=int(0.5e9)
    arr = np.arange(N)
    df = vaex.from_dict(dict(x=arr))

for i in range(1000):
    array_vaex_leak()
    time.sleep(0.5)
    print(i)
    print(f"{round(psutil.Process(os.getpid()).memory_info().rss / (1024.**3), 3)} Gbyte") 
0
1.983 Gbyte
1
3.846 Gbyte
2
1.983 Gbyte
3
3.846 Gbyte
4
5.709 Gbyte
5
7.571 Gbyte
6
9.434 Gbyte
7
11.296 Gbyte
8
0.259 Gbyte
Traceback (most recent call last):
  File "<...>\gc_play.py", line 30, in <module>
    array_vaex_leak()
  File "<...>\gc_play.py", line 26, in array_vaex_leak
    arr = np.arange(N)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.86 GiB for an array with shape (500000000,) and data type int32

I tried to debug and locate the cyclic reference using objgraph, but did not succeed yet.

Maybe someone more skilled or knowledge of vaex internals could help here?

maartenbreddels commented 1 year ago

It's a difficult topic for sure! I have experimented with this in https://github.com/vaexio/vaex/pull/1824 but I'm not sure why it failed. Maybe this is food for thought? Let me rebase that PR to see what the failure was.

anthonycorletti commented 1 year ago

im running into a similar error as described in a previous issue https://github.com/vaexio/vaex/issues/2062. @schwingkopf im curious if downgrading numpy lets your code run successfully?

anthonycorletti commented 1 year ago

numpy 1.23 had lots of changes so if you're using 1.23+ there might be something in there that could be related https://github.com/numpy/numpy/releases/tag/v1.23.0

schwingkopf commented 1 year ago

@anthonycorletti thanks for your hint. Just tried the example from my first post:

Interesting.. any ideas what that means? For the problem to appear it still requires interaction with a vaex df.

anthonycorletti commented 1 year ago

Interesting.. any ideas what that means? For the problem to appear it still requires interaction with a vaex df.

Happy to hear this at least got something working for you. I'm not exactly sure what this means unfortunately. I know that 1.22.4 has problems with mmap which might be due to this change in numpy https://github.com/numpy/numpy/pull/21446