piskvorky / bounter

Efficient Counter that uses a limited (bounded) amount of memory regardless of data size.
MIT License
935 stars 47 forks source link

Python uses signed integer to address cms table #34

Closed tjbookreader closed 5 years ago

tjbookreader commented 6 years ago

Description

The underlying C code uses unsigned 32-bit integers for the width and depth indices of the count min sketch table. However, Python is using signed 32-bit integers. This means that if you create a 16GB count min sketch with the default depth = 8 and log_counting = 8, then the width should be 16B / 8 / 1 = 2^32. But doing so causes an OverflowError.

Steps/Code/Corpus to Reproduce

from bounter import bounter
b = bounter(size_mb=16384, need_iteration=False, log_counting=8)

Expected Results

Expect to create the largest bounter possible without error.

Actual Results

The following error occurs.

bounter.py in bounter(size_mb, need_iteration, need_counts, log_counting)
     37         return HashTable(size_mb=size_mb)
     38     else:
---> 39         return CountMinSketch(size_mb=size_mb, log_counting=log_counting)
count_min_sketch.py in __init__(self, size_mb, width, depth, log_counting)
     99 
    100         if log_counting == 8:
--> 101             self.cms = cmsc.CMS_Log8(width=self.width, depth=self.depth)
    102         elif log_counting == 1024:
    103             self.cms = cmsc.CMS_Log1024(width=self.width, depth=self.depth)
OverflowError: signed integer is greater than maximum

Possible Fix

Looking at cbounter/cms_common.c the problem seems to be from Line 65 in the PyArg_ParseTupleAndKeywords(args, kwds, 'ii', kwlist, &w, &self->depth). The 'ii' tells python to use a signed integer for &w and &self->depth. Switching the lowercase 'ii' to uppercase 'II' would tell python to use an unsigned integer.

This formatting of 'ii' also appears in Line 381 Py_BuildValue("(ii)", self->width, self->depth).

Versions

Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) [GCC 7.2.0] NumPy 1.14.2 bounter 1.0.1

piskvorky commented 6 years ago

@tjbookreader it sounds like you have a good idea of where the bug is and how to fix it. Can you open a PR with the fix?

This repo is hooked to Travis, which will run all tests automatically.

tjbookreader commented 6 years ago

I'm very new to GitHub. I believe that I have created the PR. Let me know if that worked. Thanks.