Open waylonflinn opened 9 years ago
Hi waylonflinn,
overflow-check is per default in cython code deactivated, I guess this might be cause for this issue. If you could provide an example that would be great.
Thanks for building an amazingly fast library on top of bcolz! I'll try to put together a simple example that illustrates the problem.
This library was possible thanks to @CarstVaartjes & @esc. Glad to see you like it :) and looking forward to your example
Okay here's a minimum functional example that reproduces the problem.
The following code creates a table with two columns, one of which is an int64. It adds two rows. The first has the maximum signed integer (9,223,372,036,854,775,807) for it's value. The second, a one (1).
It then sums over these two columns producing the minimum signed integer as a result (−9,223,372,036,854,775,808), demonstrating integer overflow.
# create table with two ints
SIGNED_INT_MAX = 9223372036854775807
dtypes = [('test_category', 'S3'), ('test_sum_column', 'i8')]
tuples = [
('foo', SIGNED_INT_MAX),
('foo', 1)]
data = np.fromiter(tuples, dtype=dtypes, count=2)
overflow_example_table = bquery.ctable(data)
# sum column of ints
overflow_result = overflow_example_table.groupby(['test_category'], ['test_sum_column'],
agg_method='sum')
overflow_result[0]
looks like this:
(b'foo', -9223372036854775808)
Hi,
Thanks for your example, I get exactly the same results.
Setting overflowcheck
to True
should do the trick, but our actual code base raised no Exception, weird.
It seems cython.overflowcheck
does not work for compound statements like j += k
For this to work some refactoring would be needed: out_buffer[current_index] += in_buffer[i]
as out_buffer[current_index] = out_buffer[current_index] + in_buffer[i]
.
Setting this feature to True will incur a run-time penalty, in my opinion is something that should be added but it will require some discussion.
Not sure what the magnitude of the performance penalties are, but I have a couple of alternative solutions I'd like to throw into the mix.
I ran into this issue when I was doing sums on an int32
column, rather than an int64
. It's overflows on that field size that concern me most. There are two alternate solutions for that problem that I see:
int64
for all calculationsfloat32
or float64
for all calculationsMy workaround in the situation I mentioned was to (implicity) use option 2 above. I first created a new column with a float64
type and copied that data into that. Since my eventual goal was to calculate the mean, anyway, this was the natural choice.
Both of these solutions probably have performance penalties of their own. I'm curious about how they compare to one another.
i think we should try some performance tests to find out how much it really is affected. the int64 and float variants are more user options for workarounds I think. most important that silents fails are not great :)
When using the sum aggregation on a column with an integer datatype, overflows can occur. No warning or error is generated when this happens.