Open giuse opened 6 years ago
This issue is open for discussion. Currently I omit overflow check in pursuit of speed. I prefer to raise exception if overflow is detected.
I would not suggest to add overflow checks. I agree on the purpose of speed.
But if #sum
returns Integer
, I do not expect overflow. If #sum
returns Numo::UInt8
(or anyway, same class as the data), with shape=[1]
, then I expect overflow, and will take care of it.
Current implementation I expect accumulates in a C uint8
and then casts to Integer
only when going back to Ruby.
Here is my proposition: two separate methods.
#sum
instantiates a Ruby Integer
C-type, i.e. the same VALUE
that is used in the C implementation of Ruby's Integer
. Then it accumulates into it. This is less efficient than staying in Numo
, but leverages the C implementation of Integer
to take care of these problems (upcast to Bignum
, etc.). Returns a Ruby Integer
, and behaves as expected in most cases. Use cases: data analysis, checksum, etc.#typed_sum
instantiate a NArray
of the same type as self
, and accumulates the sum into it. This should be more efficient, and mostly correspond to the current implementation. Users would know the correct type to be returned, and consequently expect overflow problems and such. Use cases: size and speed optimization, fast modulus, etc.Note that both implementations can still take an optional argument (e.g. #sum(axis=true)
) such that the result is done across all or a selection axis.
With keepdims
option, #sum
method always returns NArray
.
require 'numo/narray'
ary = Numo::UInt8[255, 1]
# => Numo::UInt8#shape=[2]
# [255, 1]
tot = ary.sum(keepdims:true)
# => Numo::UInt8#shape=[1]
# [0]
I feel returned type and overflow check are separate issues.
I would not check for overflow.
I feel like when the return type is Integer
there should not be any of the problems associated with the original type.
What I wish:
require 'numo/narray'
ary = Numo::UInt8[255, 1] # => Numo::UInt8#shape=[2] [255, 1]
ary.sum # => 256 (Integer)
ary.sum(keepdims:true) # => Numo::UInt8#shape=[1] [0]
Discovering that #sum
has arithmetic overflow issues even though it returns a Ruby Integer
took many hours of debug.
IMO, every NArray method is "typed". If not so, the design of NArray is too complicated. If you prefer to the behavior of Ruby's Integer, you can write:
Numo::RObject[255, 1].sum
# => 256
If it is "typed" it should return a NArray
. Why does it return an Integer
?
This way I would know there is overflow, and I can do the cast to Integer
myself. Actually I just found out that Numo::UInt8#median
returns a Numo::UInt8#shape=[]
, which is a single number and seems perfect for #sum
too.
On methods typing
Before taking a stand on whether "all methods should be typed", please consider whether you want the user to constantly think "I have a Numo::UInt8
" (as if programming in C), or be free to think "I have an 8-bit image". IMO, NArray
is just the data container and library I am using. When I am busy designing my algorithms, I just think of them as properly stored images.
Instead, my "image":
#sum
returns Integer
but has arithmetic overflow#mean
and #stddev
#median
returns a Numo::UInt8#shape=[]
-- a perfectly sensible choice IMO#bincount
returns a Numo::UInt32
-- which I also consider a great choice.Why is #sum
"special"?
With my utmost respect, Tanaka-san, I have three questions for you:
pixels.to_a.reduce :+
. Are you OK with this? Numo::RObject
, incurring in severe performance issues, just so I can have #sum
. Are you OK with this?pixels.cast_to(Numo::DFloat).mean
, casting the whole image just because I want the average value to be a Float
. Are you OK with this?Suggestion: regardless of the specialized NArray
type,
#bincount
can return an unsigned int of proper bitsize#sum
can return a properly-sized upcast of the correct type
Numo::UInt8#sum
can return a Numo::UInt32#shape=[]
, Numo::Int8#sum
a Numo::Int32#shape=[]
, etcNumo::NArray
correctly informs the user that all shenanigans coming from explicit type handling, such as arithmetic overflow, will not be handled#mean
and #stddev
can return a Numo::DFloat#shape=[]
.
Numo::UInt8#mean
:
UInt8
and hence lose precisionDFloat
and hence lose the type.Numo::UInt8
as/if necessaryI believe this design would lead not to more complexity, but to a much simpler code, since many methods can share implementation and provide a consistent interface across data types.
While one can always debate about pure-design choices, I believe these changes would make Numo
a more intuitive and vastly more useful library for real-world use. And it is its usability that made me select it as the foundation of my current work and research.
Thank you for raising issues that should be considered. I am sorry not providing enough document to inform specification.
I have not expected this behavior (i.e. bug):
Numo::UInt8[1,2,4,5].median
# => Numo::UInt8#shape=[]
# 3
This is a zero-dimensional array to express a scalar and is not intended to be exposed to user. In current design, this should be converted to an Integer. If NArray#sum
is changed to return a zero-dimensional array, a[0]
also returns a zero-dimensional array for consistency.
I still feel this issue is not related with overflow problem. Even if UInt8#sum
implementation is replaced with Ruby's Integer method, I expect the performance is same as Array#sum
or RObject#sum
. Although conversion from UInt8
to RObject
or DFloat
takes extra memory, I am not providing inter-type implementation such as UInt8 + DFloat
because it requires N*N
implementation ( N
is the number of NArray types). It is possible that only #sum
has N*N
implementation.
Thank you for taking this issue under consideration. I sincerely care for Numo, I hope my enthusiasm correctly transpires as my fullest and most serious support.
I think using zero-dimensional arrays to express scalars is an elegant solution from a mathematical standpoint: it would make NArray
officially a tensor library.
Scalars are single numbers and are thus 0th-order tensors. https://en.wikipedia.org/wiki/Tensor
With TensorFlow still growing in popularity, I believe adding Numo.const_set('Tensor', Numo::NArray)
could be a good marketing stunt :) just kidding.
Jokes aside, in my opinion it would fit well with the methods returning typed results, delegating the final casting to the user is explicit and nice. I understand it is a bug under the current design, but without knowing the underlying design, from a purely user perspective, I felt #median
behaved as expected and #sum
(with what seemed like an Integer
suffering from overflow`) did not. Of course this is my personal perception, but I would expect it to be closer for most users, having not knowing of the interior design.
Now that you shared the whole design decision, I agree the (github-)issue has a larger scope than an overflow problem; the title is not relevant anymore. What would you suggest is a better one? Meanwhile I will change it to something closer to the heart of my suggestion.
As per the number of implementations: my suggestion would be to
NArray
: #sum
, #mean
and #stddev
UInt
/Int
/Float
/Complex
DFloat
for all types but complex, for which DComplex
is usedNumo::Bit
could use a #remove_method
, but I would leave the methods there to allow maximum flexibility to future users (which may use Numo::Bit
for purposes other than boolean arrays)#median
would already be working correctly in my suggestion, so that's 4 methods availableWhat do you think?
Sorry for talking so much and doing so little, I wish I had the skills to contribute a pull request myself (or the time to learn Ruby C). It still might happen in the future :)
[EDIT]: What I initially thought being an arithmetic overflow problem revealed itself to be a precise design choice (plus missing features/bugs) which led to the following discussion.
Using specialized data types leads to unexpected arithmetic overflow.
Which I would understand, expect and work with if (upon checking) I found
#sum
to return a type subject to such overflow, say aNumo::UInt8
(withshape=[1]
for example). But instead:I would suggest the behavior could be found more consistent if either
#sum
would return aNumo::UInt8
with overflow or anInteger
without.