visualfabriq / bquery

A query and aggregation framework for Bcolz (W2013-01)
https://www.visualfabriq.com
BSD 3-Clause "New" or "Revised" License
56 stars 11 forks source link

Overflow in Int Column Fails Silently #41

Open waylonflinn opened 9 years ago

waylonflinn commented 9 years ago

When using the sum aggregation on a column with an integer datatype, overflows can occur. No warning or error is generated when this happens.

FrancescElies commented 9 years ago

Hi waylonflinn,

overflow-check is per default in cython code deactivated, I guess this might be cause for this issue. If you could provide an example that would be great.

waylonflinn commented 9 years ago

Thanks for building an amazingly fast library on top of bcolz! I'll try to put together a simple example that illustrates the problem.

FrancescElies commented 9 years ago

This library was possible thanks to @CarstVaartjes & @esc. Glad to see you like it :) and looking forward to your example

waylonflinn commented 9 years ago

Okay here's a minimum functional example that reproduces the problem.

The following code creates a table with two columns, one of which is an int64. It adds two rows. The first has the maximum signed integer (9,223,372,036,854,775,807) for it's value. The second, a one (1).

It then sums over these two columns producing the minimum signed integer as a result (−9,223,372,036,854,775,808), demonstrating integer overflow.

# create table with two ints

SIGNED_INT_MAX = 9223372036854775807

dtypes = [('test_category', 'S3'), ('test_sum_column', 'i8')]
tuples = [
('foo', SIGNED_INT_MAX), 
('foo', 1)]

data = np.fromiter(tuples, dtype=dtypes, count=2)
overflow_example_table = bquery.ctable(data)

# sum column of ints
overflow_result = overflow_example_table.groupby(['test_category'], ['test_sum_column'],
                                                 agg_method='sum')

overflow_result[0] looks like this:

(b'foo', -9223372036854775808)
FrancescElies commented 9 years ago

Hi,

Thanks for your example, I get exactly the same results. Setting overflowcheck to True should do the trick, but our actual code base raised no Exception, weird.

It seems cython.overflowcheck does not work for compound statements like j += k For this to work some refactoring would be needed: out_buffer[current_index] += in_buffer[i] as out_buffer[current_index] = out_buffer[current_index] + in_buffer[i].

Setting this feature to True will incur a run-time penalty, in my opinion is something that should be added but it will require some discussion.

waylonflinn commented 9 years ago

Not sure what the magnitude of the performance penalties are, but I have a couple of alternative solutions I'd like to throw into the mix.

I ran into this issue when I was doing sums on an int32 column, rather than an int64. It's overflows on that field size that concern me most. There are two alternate solutions for that problem that I see:

  1. use int64 for all calculations
  2. use float32 or float64 for all calculations

My workaround in the situation I mentioned was to (implicity) use option 2 above. I first created a new column with a float64 type and copied that data into that. Since my eventual goal was to calculate the mean, anyway, this was the natural choice.

Both of these solutions probably have performance penalties of their own. I'm curious about how they compare to one another.

CarstVaartjes commented 9 years ago

i think we should try some performance tests to find out how much it really is affected. the int64 and float variants are more user options for workarounds I think. most important that silents fails are not great :)