jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Add arithmetics of Decimals #27

Closed jorgecarleitao closed 3 years ago

jorgecarleitao commented 3 years ago

E.g. add(Decimal, Decimal).

elferherrera commented 3 years ago

Hi Jorge. I was going through the code thinking about these arithmetics. I would guess the arithmetics have to consider the precision and scale of the numbers, right? For example, if we have a Decimal(5,2) (precision 5 and scale 2) the max number that can exist for that primitive is 999.99. Therefore, adding 555.55 + 555.55 should return 999.99. Is that correct? Because now if you add those two numbers you would get 1111.1 which violates the selected precision and scale

jorgecarleitao commented 3 years ago

Hi @elferherrera ! Very good points, to which I do not have a good answer for. My initial though would be to check how postgres does it.

One idea is to add different functions for different cases, e.g. checked_add that writes a null when overflows, saturating_add that saturates, add, that panics, "adaptive_add", that increases the precision when overflowing?

I was hopping that each these variations would re-use most the code.

elferherrera commented 3 years ago

I've been reading the some rules for these operations and there are more cases like that. For example, what would happen if we add two arrays with different precision and scale? Let's say a Decimal(4,3) + Decimal(5,2), then the resulting array should be Decimal(6,3) to accommodate for all the new numbers and decimals.

I see what you mean with the different functions, but how will you select when to apply them? I'm wondering if for each number in the array you will have to check for these cases and apply the required one.

jorgecarleitao commented 3 years ago

I think that as long as we are consistent, it should not be a problem. We already panick with other types atm: e.g. if any of the slots of an Int32 is i32::MAX and we add with 1 on the other side, it panics.

My thinking would be that arrow2 should not have to take a decision on which type to apply; the problem-specific applications would decide. For example, DataFusion would use the one closest to postgres, Polars could use another variation, etc. A bit like Rust does: it offers checked_add, add, saturating_add, etc, and leave each user to pick the one that fits their use-case.

The "adaptive" case is a bit tricky in datafusion atm because it assumes that the logical schema remains constant throughout the application. My hypothesis is that the user would cast the decimal to a larger precision if they feel that the operation would overflow on the current precision.

elferherrera commented 3 years ago

At the moment there is no Decimal in Rust Arrow, right?

jorgecarleitao commented 3 years ago

I think that there is ongoing work to support it here.

elferherrera commented 3 years ago

Coming back with another question about the arithmetics module. The arithmetic function takes references to dyn Array, that would mean that in the future these arithmetics could be performed on a dictionary or a struct. However, that will panic when it is downcasted to a PrimaryArray.

Don't you think it will be better to specify that the arguments for the function should be PrimitiveArray from the beginning?

jorgecarleitao commented 3 years ago

The general design in the crate has been: offer both untyped and typed APIs, so that the users can decide which one they prefer. The untyped is less prone to optimizations but easier to use, the typed benefits from help from the compiler wrt to type checking.

The idea would be to extend that function to other logical types. E.g. Primitive + Dictionary makes sense when rhs is a dictionary of primitives with the same logical type. Same for Float32 + Decimal. We can extend this this via

match (lhs.data_type(), rhs.data_type()) {
}

instead of requiring both sides to be equal and use match lhs.data_type() {}.

The typed API is arithmetic_primitive (currently not pub, though xD).

The other idea is to offer a second function can_arithmetic() -> bool, that declares the valid combinations. We do this in cast (can_cast).

elferherrera commented 3 years ago

I see. It makes sense to have both. I was wondering why you had arithmetic and arithmetic_primitive, now you had made that very clear. I guess eventually there will have to be a arithmetic_dictionary or arithmetic_structs.

However, I'm curious to know how are you planning to find what type of array the dyn Array will be? how will the function know that it is a primitive array of or a dictionary?

jorgecarleitao commented 3 years ago

Ahh, now I understand your question.

The easiest way to think about it is to reduce the trait Array to the following two functions:

trait Array {
    fn as_any(&self) -> &dyn std::any::Any;

    fn data_type(&self) -> &DataType;
}

the general flow of any function (kernel or deserialization to any format), is:

match array.data_type() {
        DataType::Float32 => {
            let array = array.as_any().downcast_ref::<PrimitiveArray<f32>>().unwrap();
            // let array = f32-specific operator
        }
        DataType::Float64 => {
            let array = array.as_any().downcast_ref::<PrimitiveArray<f64>>().unwrap();
            // let array = f64-specific operator
        }
        _ => Err("This operator is only valid for float point.".to_string()),
}

i.e. we match array.data_type() and downcast to the corresponding physical representation. There is a one-to-many correspondence between a DataType (e.g. DataType::Utf8) and a physical in-memory representation (e.g. Utf8Array<i32>). I tried to describe the idea in here, which includes the map between DataType and the physical representation.

elferherrera commented 3 years ago

I think I get it. It will make sense to add that logic in the macro macro that I added to the arithmetics module. By the way, I have been adding more docs to the functions to help me understand what they do 😀.

Regarding the Array Trait. I think it needs another function that can tell you what type of array it is. For example, we could have a primitive array and I dictionary array. Both implement Array, so both can be used as arguments in the arithmetics function. However, the dictionary will panic when you do this:

let rhs = $rhs.as_any().downcast_ref::<$primitive_array_type>();

Matching on the data type wont be enough to know the type of array used for the arithmetic. Or what do you think?

jorgecarleitao commented 3 years ago

It will make sense to add that logic in the macro macro that I added to the arithmetics module.

Yeap, that looks great! I would be grateful if you would like to contribute that as a PR.

Matching on the data type wont be enough to know the type of array used for the arithmetic. Or what do you think?

Note that the DataType of a dictionary array is not the same as the datatype of a primitive array. One is marked as DataType::Dictionary, the other is simply DataType::Int*. So, we will be able to match the dictionary and downcast it accordingly. Or am I missing something fundamental here?

elferherrera commented 3 years ago

Note that the DataType of a dictionary array is not the same as the datatype of a primitive array. One is marked as DataType::Dictionary, the other is simply DataType::Int*. So, we will be able to match the dictionary and downcast it accordingly. Or am I missing something fundamental here?

hahaha... my mistake. I just checked the DataType enum and I saw dictionary there. For a moment I thought that DataType only had primitives.

Yeap, that looks great! I would be grateful if you would like to contribute that as a PR.

Sure thing. I was planning to add it in the PR for the decimals arithmetics Im working on. Can I use the same branch to do multiple PRs? or do I have to create a separate branch per PR?

jorgecarleitao commented 3 years ago

Github does not allow the same branch to multiple PRs. It is more important to me that the individual commits are well organized so that we can cherry-pick / revert / create a changelog out of them. So, pure docs separated from functionality is good, but they do not have to be different PRs; the same PR with two commits is fine for now.

elferherrera commented 3 years ago

cool... then let me create a branch with these changes to PR the macro

elferherrera commented 3 years ago

A quick favour, do you mind checking this file with the add options for decimal?

It has the template I want to use for the other operations: subtraction, multiplication and division.

For the adaptive function I managed to adjust the scale for the resulting vector but not the precision. The problem I have is that I don't know how to keep track of the precision during the iteration process in the binary function. I'm thinking that the only way I can achieve this is by using a for loop and store the results in a vector. It will be slower but we can keep track of the precision in case it changes in one of the operations.

I created a new function called binary_with_change that will change the data_type for the resulting array. But now that I'm writing about it, I think that function has to be in the decimal folder and it should be private.

elferherrera commented 3 years ago

@jorgecarleitao do you mind linking this issue with the Decimals PR?

jorgecarleitao commented 3 years ago

Yes, good idea. You can do that: write Closes #XX in the description. :)