scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

C++ Memory error on Linux/Docker #25

Closed dustinstansbury closed 8 years ago

dustinstansbury commented 8 years ago

First off, thanks for your work on the implementation of the algorithm, it's excellent. I do most of my development on OSX and HDBSCAN has worked like a charm.

However, I've recently been trying to deploy some models requiring the package on some EC2 instances and keep getting segmentation faults associated with the munmap_chunk() method. As an example, I've included the output of a test script that I've run either on my development machine, or on the Linux box:

Test Script

import numpy as np
import cython as ct
import scipy as sp
import sklearn as sk

import subprocess

# get machine info
proc_cpu = subprocess.Popen(['/bin/bash'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
cpu_info = proc_cpu.communicate('lscpu')[0]
proc_os = subprocess.Popen(['/bin/bash'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
os_info = proc_os.communicate('uname -a')[0]

# custom logger
from crunch.logger import Logger 
logger = Logger('test hdbscan')

# report info on current machine
logger.log.info('Current hardware configuration:\n%s' % cpu_info)
logger.log.info('Current os configuration:\n%s' % os_info)

# display requirements and current python distros
requirements = """
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16
"""
logger.log.info('Requirements for HDBSCAN:\n%s' % requirements)

current_config = """
cython: %s
numpy: %s
scipy: %s
scikit-learn: %s
""" % (ct.__version__, np.__version__, sp.__version__, sk.__version__)
logger.log.info('Current Python configuration:\n%s' % current_config)

# generate cluster data
np.random.seed(4711)  # for repeatability 
sample_size = 50
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[sample_size,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[sample_size,])
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[sample_size,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[sample_size,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[sample_size,])
f = 300. * (np.random.rand(sample_size,2) - .5)

# cast data as different dtypes (doesn't matter)
X = np.asarray(np.concatenate((a, b, c, d, e, f),), dtype=np.float32) 

# attempt to run clustering
logger.log.info('Running clustering...')
from hdbscan import HDBSCAN
cl = HDBSCAN(min_cluster_size=10)
classes = cl.fit_predict(X)
logger.log.info('Success!')

logger.log.info('Results:')
print classes

The output when run on OSX

/bin/bash: line 1: lscpu: command not found
INFO   | test hdbscan | Current hardware configuration:

INFO   | test hdbscan | Current os configuration:
Darwin crunch.local 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64

INFO   | test hdbscan | Requirements for HDBSCAN:
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16

INFO   | test hdbscan | Current Python configuration:
cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
scikit-learn: 0.17.1

INFO   | test hdbscan | Running clustering...
INFO   | test hdbscan | Success!
INFO   | test hdbscan | Results:
[ 3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 -1 -1 -1 -1 -1  1  0 -1 -1 -1 -1  3 -1  1 -1  2 -1 -1 -1 -1 -1 -1 -1 -1  2

Great! Now let's try it on EC2...

The output when run on Linux

INFO   | test hdbscan | Current hardware configuration:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.058
BogoMIPS:              5000.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-3

INFO   | test hdbscan | Current os configuration:
Linux 44f4d4de5575 3.13.0-71-generic #114-Ubuntu SMP Tue Dec 1 02:34:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

INFO   | test hdbscan | Requirements for HDBSCAN:
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16

INFO   | test hdbscan | Current Python configuration:
cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
scikit-learn: 0.17

INFO   | test hdbscan | Running clustering...
*** Error in `python': munmap_chunk(): invalid pointer: 0x00000000030cef80 ***

This likely isn't a problem with the HDBSCAN package per se, but perhaps how Linux vs OSX allocates/deallocates memory. Unfortunately, I do not have direct access to the Linux box/Docker containers in order to run Valgrind or the like in order to track down the error (they're deployed by a 3rd party service, i.e. the *** Error in 'python':...***).

One possible solution is to add a compilation flag that checks for memory size, though this may affect performance. Would love to hear your thoughts...

lmcinnes commented 8 years ago

This looks like a nasty issue -- thanks for the detailed report, it will help me track it down much faster. I believe the problem to be in the UnionFind for the single linkage clustering, but I'm not actually sure what has caused the issue there (nothing obvious leaps out at least). I'll try and get some debugging done this evening when I can get access to a linux box.

felipemoraes commented 8 years ago

I have the same problem. I changed to 0.6.4 version and now it works.

lmcinnes commented 8 years ago

Thanks, that's actually helpful right now!

On Thu, Feb 25, 2016 at 8:43 PM, Felipe Moraes notifications@github.com wrote:

I have the same problem. I changed to 0.6.4 version and now it works.

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/25#issuecomment-189073964.

lmcinnes commented 8 years ago

I believe this is working now; let me know if you still see any issues.

dustinstansbury commented 8 years ago

@felipemoraes, thanks for the pointer (pun intended) on version 0.6.4; that seemed to get things working for me on Linux. @lmcinnes, it seems the same munmap_chunk() issue persists in version 0.7.1

lmcinnes commented 8 years ago

That's unfortunate; it seemed to fix the equivalent error I was managing to reproduce on the linux system I was using. I'll look into it further. It seems to involve the way Cython is handling pointers, and I may just have to accept I can't use the approach I was because it doesn't work well on linux.

lmcinnes commented 8 years ago

@dustinstansbury If you get some time and can test the current head in the repository I would be keen to know if that resolves the problem for you. Thanks.

dustinstansbury commented 8 years ago

@lmcinnes, it seems that Cython continues to exhibit some problems with pointer handling; the issue persists, even when run from HEAD.

lmcinnes commented 8 years ago

Alright; thanks for that. It seemed to be working for my linux install, but obviously that far from universal. 0.6.4 seems to be working for people so I will try and roll back changes to that for the relevant code and call it done.

dustinstansbury commented 8 years ago

Works for me! Thanks for taking the time to look into it.