wangyiqiu / dbscan-python

Theoretically Efficient and Practical Parallel DBSCAN
https://sites.google.com/view/yiqiuwang/dbscan
MIT License
70 stars 15 forks source link

DBSCAN do not free memory #18

Open ShJacub opened 11 months ago

ShJacub commented 11 months ago

Thank for your fast DBSCAN realization. I have a problem. Calling dbscan.DBSCAN(x) consums additional memory. If I call dbscan.DBSCAN(x) n time consums n*V memory, where V is memory for one dbscan.DBSCAN(x) calling.

anivegesana commented 11 months ago

It is possible that didn't handle the reference counts in the Python wrapper correctly. Can you share a small example that I can use to reproduce the issue?


This seems to reproduce it:

import dbscan
import sys
f = np.random.rand(5000000, 5)
for i in range(100):
  q = dbscan.DBSCAN(f)

sys.getrefcount(q)
sys.getrefcount(q[0])

If the object passed in is a NULL pointer, it is assumed that this was caused because the call producing the argument found an error and set an exception. https://docs.python.org/3/c-api/arg.html#c.Py_BuildValue

Does the following build fix the issue for you?

pip install git+https://github.com/anivegesana/dbscan-python@memleak
ShJacub commented 11 months ago

Good day, I'm sorry for late response. I ran your code and I have the same memory leak. Outputs of sys.getrefcount(q), sys.getrefcount(q[0]) are 2 and 3. I also tried pip install git+https://github.com/anivegesana/dbscan-python@memleak and have the same memory leak.

anivegesana commented 11 months ago

Hey @ShJacub,

I think I need some help reproducing the memory leak on my new branch. This is code snippet that I am currently running and its output.

>>> import numpy as np
>>> import dbscan
>>> dbscan.__version__
'0.0.12.dev1+gc993316.d20230420'
>>> x = np.random.rand(5000000, 5)
>>> r = dbscan.DBSCAN(x)
>>> sys.getrefcount(r)
2
>>> sys.getrefcount(r[0])
2
>>> sys.getrefcount(r[1])
2
>>> import weakref
>>> del r
>>> g = weakref.ref(r[0])
>>> g()
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)
>>> gc.collect()
480
>>> g()

Thank you for pointing out the memory leak. I need to be a little bit more careful when reading Python C API documentation since information about borrowed and owned references isn't always in an obvious place. 😅

ShJacub commented 11 months ago

Is it possible to solve this problem?

anivegesana commented 11 months ago

Yes, it is. Just need some more information. Can you run the code that I shared in the previous comment and share the output? Also, can you share the OS, Python version, and NumPy version that you are using?

ShJacub commented 11 months ago

Ubuntu 20.04.6 LTS python 3.8.10 numpy 1.24.4

I ran code placed above. These are outputs:

import numpy as np import dbscan dbscan.version '0.0.12' x = np.random.rand(5000000, 5) r = dbscan.DBSCAN(x) sys.getrefcount(r) Traceback (most recent call last): File "", line 1, in NameError: name 'sys' is not defined import sys sys.getrefcount(r) 2 sys.getrefcount(r[0]) 3 sys.getrefcount(r[1]) 3 import weakref del r g = weakref.ref(r[0]) Traceback (most recent call last): File "", line 1, in NameError: name 'r' is not defined g() Traceback (most recent call last): File "", line 1, in NameError: name 'g' is not defined gc.collect() Traceback (most recent call last): File "", line 1, in NameError: name 'gc' is not defined g() Traceback (most recent call last): File "", line 1, in NameError: name 'g' is not defined

anivegesana commented 11 months ago

Sorry about the delay. Was a little bit busy at work for a couple of days and didn't have a chance to take a look at this. It seems like the version of the dbscan-python library doesn't match up. You have the production version (0.0.12) and I have the version with the fix (0.0.12.dev1+gc993316.d20230420.) Perhaps it failed to compile on your machine? I will build you a wheel tonight for you to try it out.

For some reason, the version name on the wheel is messed up, but it should mostly work fine.

curl 'https://drive.google.com/uc?export=download&id=1Wrglr9Xo9dyDiD9ngPLcPjI9sFCiYLIJ' -o "dbscan-0.1.dev90+g6a8f3e3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl"
pip install "dbscan-0.1.dev90+g6a8f3e3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl" --force-reinstall --no-deps
alwansm commented 9 months ago

Hello,

Please, I have encountered a memory problem but in C++. When I call the DBSCAN function inside a for loop, the memory usage increases and keeps increasing until it crashes

#include <iostream>

#include "dbscan/capi.h"
#include "dbscan/point.h"
#include "dbscan/geometryIO.h"
#include "dbscan/pbbs/parallel.h"
#include "dbscan/pbbs/parseCommandLine.h"

int main(int argc, char *argv[])
{

  commandLine P(argc, argv, "[-o <outFile>] [-eps <p_epsilon>] [-minpts <p_minpts>] <inFile>");
  char *iFile = P.getArgument(0);
  char *oFile = P.getOptionValue("-o");
  size_t rounds = P.getOptionIntValue("-r", 1);
  double p_epsilon = P.getOptionDoubleValue("-eps", 1);
  size_t p_minpts = P.getOptionIntValue("-minpts", 1);
  double p_rho = P.getOptionDoubleValue("-rho", -1);

  for (int i = 0; i < 10000; i++)
  {
    parlay::internal::start_scheduler();
    int dim = readHeader(iFile);
    _seq<double> PIn = readDoubleFromFile(iFile, dim);

    bool *coreFlag = new bool[PIn.n / dim];
    int *cluster = new int[PIn.n / dim];
    double *data = PIn.A;

    if (DBSCAN(dim, PIn.n / dim, data, p_epsilon, p_minpts, coreFlag, cluster))
      cout << "Error: dimension >20 is not supported." << endl;

    // if (oFile != NULL)
    // {
    //   writeArrayToFile("cluster-id", cluster, PIn.n / dim, oFile);
    // }

    PIn.del();
    delete[] coreFlag;
    delete[] cluster;
    parlay::internal::stop_scheduler();
  }
  return 0;
}
william-silversmith commented 8 months ago

I ran into a similar problem in some code I wrote a while back (not in DBSCAN) and here was its solution: https://github.com/seung-lab/mapbuffer/commit/3746bd873d6403080ff1ee854c151e3fd037722c

I suspect the issue has something to do with not releasing Xobj or X. You may want to call: Py_DECREF(X); to decrement the reference count on the python object before returning stuff.

anivegesana commented 8 months ago

Oh yes. You are absolutely right! PyArg_ParseTupleAndKeywords creates a strong reference to objects. The C API has a lot of sharp corners. Will try it out. I don't think this solves @alwansm 's problem, but Valgrind should help find that.

GunjanKholapure commented 6 months ago

Any updates on this? It increasingly takes up more memory in python when run continuously

GunjanKholapure commented 5 months ago

Update: This is what worked for me:

Since I just needed labels and not the core mask I have changed the return statement and I haven't seen any memory leak after this for my use case:

Instead of returning: return PyTuple_Pack(2, labels, core_samples);

I am returning: Py_DECREF(X); Py_DECREF(core_samples); return (PyObject*) labels;

After making these changes I did pip install -e . in the repo I had to comment these lines for it to work: py_limited_api=True ('Py_LIMITED_API', '0x03020000')

GunjanKholapure commented 4 months ago

Hey Yuki Thanks a lot for this. Is it possible to publish a new pip package with the fixes? That would be very useful.

Gunjan

On Wed, 19 Jun 2024 at 10:36, yuki-inaho @.***> wrote:

JFYI, I have addressed the memory leak issue by implementing the following changes. You can see the commits that were made to resolve the issue in the forked repository:

Commit 1 https://github.com/yuki-inaho/dbscan-python/commit/1bb13f36f2205b7535be5aaadbed4987fd7239b8 Commit 2 https://github.com/yuki-inaho/dbscan-python/commit/748040c67a182405523ae62a2647fc0cc257e7ff We confirmed the improvement by running tests outlined in the following notebook: Memory Leak Check Notebook https://github.com/yuki-inaho/dbscan_comparison/blob/main/check_memory_leaks.ipynb

— Reply to this email directly, view it on GitHub https://github.com/wangyiqiu/dbscan-python/issues/18#issuecomment-2177752568, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOFE36LNUKB4X5DLHXQPODZIEGUPAVCNFSM6AAAAABAJ7AECGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXG42TENJWHA . You are receiving this because you commented.Message ID: @.***>