C++ Memory error on Linux/Docker

dustinstansbury commented 8 years ago

First off, thanks for your work on the implementation of the algorithm, it's excellent. I do most of my development on OSX and HDBSCAN has worked like a charm.

However, I've recently been trying to deploy some models requiring the package on some EC2 instances and keep getting segmentation faults associated with the munmap_chunk() method. As an example, I've included the output of a test script that I've run either on my development machine, or on the Linux box:

Test Script

import numpy as np
import cython as ct
import scipy as sp
import sklearn as sk

import subprocess

# get machine info
proc_cpu = subprocess.Popen(['/bin/bash'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
cpu_info = proc_cpu.communicate('lscpu')[0]
proc_os = subprocess.Popen(['/bin/bash'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
os_info = proc_os.communicate('uname -a')[0]

# custom logger
from crunch.logger import Logger 
logger = Logger('test hdbscan')

# report info on current machine
logger.log.info('Current hardware configuration:\n%s' % cpu_info)
logger.log.info('Current os configuration:\n%s' % os_info)

# display requirements and current python distros
requirements = """
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16
"""
logger.log.info('Requirements for HDBSCAN:\n%s' % requirements)

current_config = """
cython: %s
numpy: %s
scipy: %s
scikit-learn: %s
""" % (ct.__version__, np.__version__, sp.__version__, sk.__version__)
logger.log.info('Current Python configuration:\n%s' % current_config)

# generate cluster data
np.random.seed(4711)  # for repeatability 
sample_size = 50
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[sample_size,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[sample_size,])
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[sample_size,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[sample_size,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[sample_size,])
f = 300. * (np.random.rand(sample_size,2) - .5)

# cast data as different dtypes (doesn't matter)
X = np.asarray(np.concatenate((a, b, c, d, e, f),), dtype=np.float32) 

# attempt to run clustering
logger.log.info('Running clustering...')
from hdbscan import HDBSCAN
cl = HDBSCAN(min_cluster_size=10)
classes = cl.fit_predict(X)
logger.log.info('Success!')

logger.log.info('Results:')
print classes

The output when run on OSX

/bin/bash: line 1: lscpu: command not found
INFO   | test hdbscan | Current hardware configuration:

INFO   | test hdbscan | Current os configuration:
Darwin crunch.local 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64

INFO   | test hdbscan | Requirements for HDBSCAN:
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16

INFO   | test hdbscan | Current Python configuration:
cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
scikit-learn: 0.17.1

INFO   | test hdbscan | Running clustering...
INFO   | test hdbscan | Success!
INFO   | test hdbscan | Results:
[ 3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 -1 -1 -1 -1 -1  1  0 -1 -1 -1 -1  3 -1  1 -1  2 -1 -1 -1 -1 -1 -1 -1 -1  2

Great! Now let's try it on EC2...

The output when run on Linux

INFO   | test hdbscan | Current hardware configuration:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.058
BogoMIPS:              5000.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-3

INFO   | test hdbscan | Current os configuration:
Linux 44f4d4de5575 3.13.0-71-generic #114-Ubuntu SMP Tue Dec 1 02:34:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

INFO   | test hdbscan | Requirements for HDBSCAN:
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16

INFO   | test hdbscan | Current Python configuration:
cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
scikit-learn: 0.17

INFO   | test hdbscan | Running clustering...
*** Error in `python': munmap_chunk(): invalid pointer: 0x00000000030cef80 ***

This likely isn't a problem with the HDBSCAN package per se, but perhaps how Linux vs OSX allocates/deallocates memory. Unfortunately, I do not have direct access to the Linux box/Docker containers in order to run Valgrind or the like in order to track down the error (they're deployed by a 3rd party service, i.e. the *** Error in 'python':...***).

One possible solution is to add a compilation flag that checks for memory size, though this may affect performance. Would love to hear your thoughts...