API and method return types unification by making Numo a tensor library [ex: Arithmetic overflow]

giuse commented 6 years ago

[EDIT]: What I initially thought being an arithmetic overflow problem revealed itself to be a precise design choice (plus missing features/bugs) which led to the following discussion.

Using specialized data types leads to unexpected arithmetic overflow.

require 'numo/narray'
ary = Numo::UInt8[255, 1] # => Numo::UInt8#shape=[2]  [1, 255]
tot = ary.sum             # => 0

Which I would understand, expect and work with if (upon checking) I found #sum to return a type subject to such overflow, say a Numo::UInt8 (with shape=[1] for example). But instead:

tot.class                 # => Integer

I would suggest the behavior could be found more consistent if either #sum would return a Numo::UInt8 with overflow or an Integer without.

masa16 commented 6 years ago

This issue is open for discussion. Currently I omit overflow check in pursuit of speed. I prefer to raise exception if overflow is detected.

giuse commented 6 years ago

I would not suggest to add overflow checks. I agree on the purpose of speed.

But if #sum returns Integer, I do not expect overflow. If #sum returns Numo::UInt8 (or anyway, same class as the data), with shape=[1], then I expect overflow, and will take care of it.
Current implementation I expect accumulates in a C uint8 and then casts to Integer only when going back to Ruby.

Here is my proposition: two separate methods.

#sum instantiates a Ruby Integer C-type, i.e. the same VALUE that is used in the C implementation of Ruby's Integer. Then it accumulates into it. This is less efficient than staying in Numo, but leverages the C implementation of Integer to take care of these problems (upcast to Bignum, etc.). Returns a Ruby Integer, and behaves as expected in most cases. Use cases: data analysis, checksum, etc.
#typed_sum instantiate a NArray of the same type as self, and accumulates the sum into it. This should be more efficient, and mostly correspond to the current implementation. Users would know the correct type to be returned, and consequently expect overflow problems and such. Use cases: size and speed optimization, fast modulus, etc.

Note that both implementations can still take an optional argument (e.g. #sum(axis=true)) such that the result is done across all or a selection axis.

masa16 commented 6 years ago

With keepdims option, #sum method always returns NArray.

require 'numo/narray'

ary = Numo::UInt8[255, 1]
# => Numo::UInt8#shape=[2]
#    [255, 1]

tot = ary.sum(keepdims:true)
# => Numo::UInt8#shape=[1]
#    [0]

I feel returned type and overflow check are separate issues.

giuse commented 6 years ago

I would not check for overflow.
I feel like when the return type is Integer there should not be any of the problems associated with the original type.

What I wish:

require 'numo/narray'
ary = Numo::UInt8[255, 1] # => Numo::UInt8#shape=[2]  [255, 1]
ary.sum # => 256 (Integer)
ary.sum(keepdims:true) # => Numo::UInt8#shape=[1]  [0]

Discovering that #sum has arithmetic overflow issues even though it returns a Ruby Integer took many hours of debug.

masa16 commented 6 years ago

IMO, every NArray method is "typed". If not so, the design of NArray is too complicated. If you prefer to the behavior of Ruby's Integer, you can write:

Numo::RObject[255, 1].sum
# => 256

giuse commented 6 years ago

If it is "typed" it should return a NArray. Why does it return an Integer?

This way I would know there is overflow, and I can do the cast to Integer myself. Actually I just found out that Numo::UInt8#median returns a Numo::UInt8#shape=[], which is a single number and seems perfect for #sum too.

On methods typing

Before taking a stand on whether "all methods should be typed", please consider whether you want the user to constantly think "I have a Numo::UInt8" (as if programming in C), or be free to think "I have an 8-bit image". IMO, NArray is just the data container and library I am using. When I am busy designing my algorithms, I just think of them as properly stored images.

Instead, my "image":

#sum returns Integer but has arithmetic overflow
Does not offer other basic statistics such as #mean and #stddev
#median returns a Numo::UInt8#shape=[] -- a perfectly sensible choice IMO
#bincount returns a Numo::UInt32 -- which I also consider a great choice.

Why is #sum "special"?

With my utmost respect, Tanaka-san, I have three questions for you:

In order to create a checksum of my image I am currently doing pixels.to_a.reduce :+. Are you OK with this?
Alternatively, you suggested I should load my (hundred of thousands of) 8-bit images as Numo::RObject, incurring in severe performance issues, just so I can have #sum. Are you OK with this?
In order to know the average intensity I am currently doing pixels.cast_to(Numo::DFloat).mean, casting the whole image just because I want the average value to be a Float. Are you OK with this?

Suggestion: regardless of the specialized NArray type,

All methods that access the content should be typed to the content type
All methods that provide statistics on the content should be available, and typed to a sensible representation of the statistic, e.g.:
- #bincount can return an unsigned int of proper bitsize
- #sum can return a properly-sized upcast of the correct type
  - e.g. Numo::UInt8#sum can return a Numo::UInt32#shape=[], Numo::Int8#sum a Numo::Int32#shape=[], etc
  - particularly, returning a Numo::NArray correctly informs the user that all shenanigans coming from explicit type handling, such as arithmetic overflow, will not be handled
- #mean and #stddev can return a Numo::DFloat#shape=[].
  - There are three choices to answer a user that calls Numo::UInt8#mean:
    1. not offer the function at all
    2. return a UInt8 and hence lose precision
    3. return a DFloat and hence lose the type.
  - My opinion is that:
    1. is the worst because it limits the functionality, requiring the user to cast (potentially huge) data to a different type just to grab a statistic.
    2. is premature optimization: statistics are by definition few numbers describing large amount of numbers, there is no need for them to be optimized in type/space.
    3. would be immediately obvious to the user that it is the highest casting type, would present high precision, and the user can still cast it back to Numo::UInt8 as/if necessary

I believe this design would lead not to more complexity, but to a much simpler code, since many methods can share implementation and provide a consistent interface across data types.

While one can always debate about pure-design choices, I believe these changes would make Numo a more intuitive and vastly more useful library for real-world use. And it is its usability that made me select it as the foundation of my current work and research.

masa16 commented 6 years ago

Thank you for raising issues that should be considered. I am sorry not providing enough document to inform specification.

I have not expected this behavior (i.e. bug):

Numo::UInt8[1,2,4,5].median
# => Numo::UInt8#shape=[]
# 3

This is a zero-dimensional array to express a scalar and is not intended to be exposed to user. In current design, this should be converted to an Integer. If NArray#sum is changed to return a zero-dimensional array, a[0] also returns a zero-dimensional array for consistency.

I still feel this issue is not related with overflow problem. Even if UInt8#sum implementation is replaced with Ruby's Integer method, I expect the performance is same as Array#sum or RObject#sum. Although conversion from UInt8 to RObject or DFloattakes extra memory, I am not providing inter-type implementation such as UInt8 + DFloat because it requires N*N implementation ( N is the number of NArray types). It is possible that only #sum has N*N implementation.

giuse commented 6 years ago

Thank you for taking this issue under consideration. I sincerely care for Numo, I hope my enthusiasm correctly transpires as my fullest and most serious support.

I think using zero-dimensional arrays to express scalars is an elegant solution from a mathematical standpoint: it would make NArray officially a tensor library.

Scalars are single numbers and are thus 0th-order tensors. https://en.wikipedia.org/wiki/Tensor

With TensorFlow still growing in popularity, I believe adding Numo.const_set('Tensor', Numo::NArray) could be a good marketing stunt :) just kidding.

Jokes aside, in my opinion it would fit well with the methods returning typed results, delegating the final casting to the user is explicit and nice. I understand it is a bug under the current design, but without knowing the underlying design, from a purely user perspective, I felt #median behaved as expected and #sum (with what seemed like an Integer suffering from overflow`) did not. Of course this is my personal perception, but I would expect it to be closer for most users, having not knowing of the interior design.

Now that you shared the whole design decision, I agree the (github-)issue has a larger scope than an overflow problem; the title is not relevant anymore. What would you suggest is a better one? Meanwhile I will change it to something closer to the heart of my suggestion.

As per the number of implementations: my suggestion would be to

Implement three methods in NArray: #sum, #mean and #stddev
Add one pair of constants to each type, i.e. 2n constants for n types
First constant is type for accumulators: would identify the largest available accumulator, depending on the base type being UInt/Int/Float/Complex
Second constant is type for statistics: DFloat for all types but complex, for which DComplex is used
It should work because C has implicit up-cast depending on the type of the accumulator/return
I guess Numo::Bit could use a #remove_method, but I would leave the methods there to allow maximum flexibility to future users (which may use Numo::Bit for purposes other than boolean arrays)
Note #median would already be working correctly in my suggestion, so that's 4 methods available

What do you think?
Sorry for talking so much and doing so little, I wish I had the skills to contribute a pull request myself (or the time to learn Ruby C). It still might happen in the future :)

ruby-numo / numo-narray

API and method return types unification by making Numo a tensor library [ex: Arithmetic overflow] #91