Closed jorgecarleitao closed 3 years ago
Hi Jorge. I was going through the code thinking about these arithmetics. I would guess the arithmetics have to consider the precision and scale of the numbers, right? For example, if we have a Decimal(5,2) (precision 5 and scale 2) the max number that can exist for that primitive is 999.99. Therefore, adding 555.55 + 555.55 should return 999.99. Is that correct? Because now if you add those two numbers you would get 1111.1 which violates the selected precision and scale
Hi @elferherrera ! Very good points, to which I do not have a good answer for. My initial though would be to check how postgres does it.
One idea is to add different functions for different cases, e.g. checked_add
that writes a null when overflows, saturating_add
that saturates, add
, that panics, "adaptive_add", that increases the precision when overflowing?
I was hopping that each these variations would re-use most the code.
I've been reading the some rules for these operations and there are more cases like that. For example, what would happen if we add two arrays with different precision and scale? Let's say a Decimal(4,3) + Decimal(5,2), then the resulting array should be Decimal(6,3) to accommodate for all the new numbers and decimals.
I see what you mean with the different functions, but how will you select when to apply them? I'm wondering if for each number in the array you will have to check for these cases and apply the required one.
I think that as long as we are consistent, it should not be a problem. We already panick with other types atm: e.g. if any of the slots of an Int32
is i32::MAX
and we add with 1 on the other side, it panics.
My thinking would be that arrow2 should not have to take a decision on which type to apply; the problem-specific applications would decide. For example, DataFusion would use the one closest to postgres, Polars could use another variation, etc. A bit like Rust does: it offers checked_add
, add
, saturating_add
, etc, and leave each user to pick the one that fits their use-case.
The "adaptive" case is a bit tricky in datafusion atm because it assumes that the logical schema remains constant throughout the application. My hypothesis is that the user would cast the decimal to a larger precision if they feel that the operation would overflow on the current precision.
At the moment there is no Decimal in Rust Arrow, right?
I think that there is ongoing work to support it here.
Coming back with another question about the arithmetics module. The arithmetic
function takes references to dyn Array
, that would mean that in the future these arithmetics could be performed on a dictionary or a struct. However, that will panic when it is downcasted to a PrimaryArray.
Don't you think it will be better to specify that the arguments for the function should be PrimitiveArray
The general design in the crate has been: offer both untyped and typed APIs, so that the users can decide which one they prefer. The untyped is less prone to optimizations but easier to use, the typed benefits from help from the compiler wrt to type checking.
The idea would be to extend that function to other logical types. E.g. Primitive + Dictionary
makes sense when rhs
is a dictionary of primitives with the same logical type. Same for Float32 + Decimal. We can extend this this via
match (lhs.data_type(), rhs.data_type()) {
}
instead of requiring both sides to be equal and use match lhs.data_type() {}
.
The typed API is arithmetic_primitive
(currently not pub, though xD).
The other idea is to offer a second function can_arithmetic() -> bool
, that declares the valid combinations. We do this in cast
(can_cast
).
I see. It makes sense to have both. I was wondering why you had arithmetic
and arithmetic_primitive
, now you had made that very clear. I guess eventually there will have to be a arithmetic_dictionary
or arithmetic_structs
.
However, I'm curious to know how are you planning to find what type of array the dyn Array
will be? how will the function know that it is a primitive array of or a dictionary?
Ahh, now I understand your question.
The easiest way to think about it is to reduce the trait Array to the following two functions:
trait Array {
fn as_any(&self) -> &dyn std::any::Any;
fn data_type(&self) -> &DataType;
}
the general flow of any function (kernel or deserialization to any format), is:
match array.data_type() {
DataType::Float32 => {
let array = array.as_any().downcast_ref::<PrimitiveArray<f32>>().unwrap();
// let array = f32-specific operator
}
DataType::Float64 => {
let array = array.as_any().downcast_ref::<PrimitiveArray<f64>>().unwrap();
// let array = f64-specific operator
}
_ => Err("This operator is only valid for float point.".to_string()),
}
i.e. we match array.data_type()
and downcast to the corresponding physical representation. There is a one-to-many correspondence between a DataType
(e.g. DataType::Utf8
) and a physical in-memory representation (e.g. Utf8Array<i32>
). I tried to describe the idea in here, which includes the map between DataType and the physical representation.
I think I get it. It will make sense to add that logic in the macro macro that I added to the arithmetics module. By the way, I have been adding more docs to the functions to help me understand what they do 😀.
Regarding the Array Trait. I think it needs another function that can tell you what type of array it is. For example, we could have a primitive array
and I dictionary array
. Both implement Array, so both can be used as arguments in the arithmetics
function. However, the dictionary will panic when you do this:
let rhs = $rhs.as_any().downcast_ref::<$primitive_array_type>();
Matching on the data type wont be enough to know the type of array used for the arithmetic. Or what do you think?
It will make sense to add that logic in the macro macro that I added to the arithmetics module.
Yeap, that looks great! I would be grateful if you would like to contribute that as a PR.
Matching on the data type wont be enough to know the type of array used for the arithmetic. Or what do you think?
Note that the DataType
of a dictionary array is not the same as the datatype of a primitive array. One is marked as DataType::Dictionary
, the other is simply DataType::Int*
. So, we will be able to match
the dictionary and downcast it accordingly. Or am I missing something fundamental here?
Note that the DataType of a dictionary array is not the same as the datatype of a primitive array. One is marked as DataType::Dictionary, the other is simply DataType::Int*. So, we will be able to match the dictionary and downcast it accordingly. Or am I missing something fundamental here?
hahaha... my mistake. I just checked the DataType enum and I saw dictionary there. For a moment I thought that DataType only had primitives.
Yeap, that looks great! I would be grateful if you would like to contribute that as a PR.
Sure thing. I was planning to add it in the PR for the decimals arithmetics Im working on. Can I use the same branch to do multiple PRs? or do I have to create a separate branch per PR?
Github does not allow the same branch to multiple PRs. It is more important to me that the individual commits are well organized so that we can cherry-pick / revert / create a changelog out of them. So, pure docs separated from functionality is good, but they do not have to be different PRs; the same PR with two commits is fine for now.
cool... then let me create a branch with these changes to PR the macro
A quick favour, do you mind checking this file with the add options for decimal?
It has the template I want to use for the other operations: subtraction, multiplication and division.
For the adaptive function I managed to adjust the scale for the resulting vector but not the precision. The problem I have is that I don't know how to keep track of the precision during the iteration process in the binary function. I'm thinking that the only way I can achieve this is by using a for loop and store the results in a vector. It will be slower but we can keep track of the precision in case it changes in one of the operations.
I created a new function called binary_with_change
that will change the data_type for the resulting array. But now that I'm writing about it, I think that function has to be in the decimal folder and it should be private.
@jorgecarleitao do you mind linking this issue with the Decimals PR?
Yes, good idea. You can do that: write Closes #XX
in the description. :)
E.g.
add(Decimal, Decimal)
.