prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16k stars 5.36k forks source link

Implement DECIMAL type #2797

Closed cberner closed 8 years ago

losipiuk commented 9 years ago

We were thinking about initial implementation of standard DECIMAL data type in Presto. If I understand correctly Presto currently does not support data types with explicit length limit, so we would like to start implementing unparametrized version - like current VARCHAR data type in presto.

As for implementation we wanted to do this in similar way VARCHAR and JSON data types is implemented. If we decide on implementing bounded 128 bit DECIMAL, we would like internally to represent it as a 17 bytes long Slice (16 bytes for digits and 1 for decimal mark position). Later we can change it to more optimized encoding (e.g. Varint used in Protocol-Buffers).

Initially we would like to implement support of DECIMAL in Hive connector, but of course other connectors can also be extended to return DECIMAL columns.

As a start we would like to implement basic arithmetic operations. And casting functions. And add more advanced functions gradually.

For arithmetics we though of decoding Slice to either standard Java BigDecimal or some class based on Hive's Decimal128. Or maybe you have better candidates?

It seemed to me that implementation should be quite straightforward - but it is possible that I missed some tricky parts. Is there some part of code I should investigate more deeply?

And what in general you think about above approach? Does it make sense?

cc: @pnowojski, @KBP-TDC, @ilfrin

dain commented 9 years ago

I have a pending PR, https://github.com/facebook/presto/pull/2478, which adds support for Types with constant parameters (e.g., VARCHAR(LENGTH)). I suggest you start with that, and implement a DECIMAL type with two physical implementations, 64bit and 128bit, choses based on the precision.

For the 128bit version, the stack representation would have to be a Slice until @haozhun finishes adding support for any Object stack type, and then you could use a simple struct containing 2 longs. Assuming the Hive Decimal128 libraries are correct and fast, I would prefer them over Java BigDecimal which is crazy slow.

losipiuk commented 9 years ago

Great to hear about #2478. Are you planning to merge it in some time soon? Is that finished and we can base on that assuming that code works - or is there some missing stuff still?

As for two implementations I would rather start with one. Just to make something working. And optimize in next step.

You suggest that just one long for 64 bit decimal and two longs for 128 bit would be enough. But we still need some place to store decimal mark location, right? It seems like we can use the type parameters for that - but shouldn't we also support unparametrized DECIMAL? Then we need some extra space for decimal point. Or am I missing something?

Decimal128 implementation is not very convincing. And only support basic operations - e.g. multiplication and division fallbacks to BigDecimal internally. Yet still it is probably faster and have smaller memory footprint than BigDecimal. As for correctness I would trust BigDecimal much more.

I am not super convinced that performance should be major concern in case of DECIMAL type. This one is typically not best performing one in other databases - and probably users expect that. Still lets try to make it as fast as possible, without sacrificing too much readability.

dain commented 9 years ago

I believe it is finished. I am waiting on a code review from @martint, and he has been very busy lately.

As for unparametrized DECIMAL, is that legal in SQL? If not, I don't think it is something we should support as it would mean that every value could have a different scale.

losipiuk commented 9 years ago

Cool - we will base our work on top of your patch.

As for parametrization, I do not know what actual standard says. It seems like most of the databases treat decimal parameters as obligatory or just assume some default parameter values if skipped.

So probably we may safely assume that unparameterized version is not supported.

losipiuk commented 9 years ago

@dain: A Question.

Most databases (i think it is dictated by ANSI standard) treat literals like '123.34' as DECIMALS. Presto treats such literals are interpreted as DOUBLE PRECISION.

Ultimatelly we should probably change semantics here, so casting is done from DECIMAL to DOUBLE and not vice versa. DOUBLE is inaccurate by design and you want accurate value if you want to compare values from DECIMAL column with DECIMAL literal in query.

Should this work be tracked with this issue. Or should we just start with implementic DOUBLE to DECIMAL casting?

Or maybe I should introduce syntax similar to one which is used for DATE and TIMESTAMPTS. If I want a DECIMAL literal I prefix it with DECIMAL keyword.

SELECT DECIMAL 123.456 FROM ...

What do you think? I am leaning toward DECIMAL keyword approach.

cc: @sopel39

losipiuk commented 9 years ago

The work-in-progress branch for this feature is here https://github.com/TeradataCenterForHadoop/presto/tree/decimal.

Initial implementation is limited to 64 bit as value is stored as long internally. I want to add another (Slice based) implementation which would support 128 bit numbers.

After discussion with @sopel39 we think that it would nice to be able to define scalar functions operating on DECIMAL(x,y) arguments via @Scalar annotation.

Then user would have to define multiple versions of UDF:

@Scalar
@SqlType("decimal(x,y)")
long fun(SqlType("decimal(z,v)") long param)

@Scalar
@SqlType("decimal(x,y)")
Slice fun(SqlType("decimal(z,v)") Slice param)

It seemed to us that we need to introduce some mechanism to guide function matching code which method should be called depending on type literal parameters. We though of introducing @MatchCondition annotation where you could pass boolean expression as parameter.

In above example first method would be annotated with: @MatchCondition("x <= 19") and second one with: MatchCondition("x > 19")

Now I though that maybe we could skip that at all. And just do the matching based on function java types vs types returned by Type.getJavaType(), for actual presto Types for which function is called, after semantic analysis.

But it seem to be too limiting. There are cases when would need to return different java type (Slice vs long) for same java argument types. The decision on return type need to be made based on literal parameters of actual types of function arguments.

So it seems like @MatchCondition is needed after all. What do you think?

cc: @sopel39

sopel39 commented 9 years ago

We further thought about how to simplify definition of operators and scalars for more advanced types like DECIMAL (which can be represented by either long or Slice depending on precision). We propose following convention:

private static final String ADD_RESULT_TYPE = 
   "decimal(min(38, 1 + max(aScale, bScale) + max(aPrecision - aScale, bPrecision - bScale)) as resultPrecision, "
 + "        max(aScale, bScale) as resultScale)";

@SqlType(ADD_RESULT_TYPE)
@Filter("resultPrecision <= 18")
@Filter("aPrecision <= 18")
@Filter("bPrecision <= 18")
long addShortShortShort(@SqlType("decimal(aPrecision, aScale)") long aDecimal, 
                        @SqlType("decimal(bPrecision, bScale)") long bDecimal, 
                        long aRescale, long bRescale)
{
...
}

// define first add method with long aRescale and long bRescale
functionRegistry.addFunctions(parametrizedMethods(DecimalOperators.class, "addShortShortShort")
    .withExtraParameters((literals, types) ->
        ImmutableList.builder()
            .add(Math.pow(10, literals.get("resultScale") - literals.get("aScale")))
            .add(Math.pow(10, literals.get("resultScale") - literals.get("bScale")))
            .build()
    .build())

@SqlType(ADD_RESULT_TYPE)
@Filter("resultPrecision > 18")
@Filter("aPrecision <= 18")
@Filter("bPrecision <= 18")
Slice addShortShortLong(@SqlType("decimal(aPrecision, aScale)") long aDecimal, 
                        @SqlType("decimal(bPrecision, bScale)") long bDecimal, 
                        BigInteger aRescale, BigInteger bRescale)
{
...
}

@SqlType(ADD_RESULT_TYPE)
@Filter("aPrecision > 18")
@Filter("bPrecision <= 18")
Slice addLongShortLong(@SqlType("decimal(aPrecision, aScale)") Slice aDecimal, 
                       @SqlType("decimal(bPrecision, bScale)") long bDecimal, 
                       BigInteger aRescale, BigInteger bRescale)
{
...
}

@SqlType(ADD_RESULT_TYPE)
@Filter("aPrecision <= 18")
@Filter("bPrecision > 18")
Slice addShortLongLong(@SqlType("decimal(aPrecision, aScale)") long aDecimal, 
                       @SqlType("decimal(bPrecision, bScale)") Slice bDecimal, 
                       BigInteger aRescale, BigInteger bRescale)
{
...
}

// other methods take BigInteger as aRescale and bRescale, so we can define them together
functionRegistry.addFunctions(parametrizedMethods(DecimalOperators.class, "addShortShortLong", "addLongShortLong", "addShortLongLong")
    .withExtraParameters((literals, types) ->
        ImmutableList.builder()
            .add(TEN.pow(literals.get("resultScale") - literals.get("aScale")))
            .add(TEN.pow(literals.get("resultScale") - literals.get("bScale")))
            .build()
    .build())

By using such convention we don't have to deal with low level MethodHandlers and create ParametricOperator classes with multiple if branches. This in our opinion makes definition of operators for DECIMAL and possibly other types more concise since the code is more readable and shorter.

For this convention to work, we would need to add support for filters in Signature and also pass bound literals to specialize method of ParametricFunction.

What do you think @dain?

FYI @electrum @losipiuk @ilfrin

cberner commented 9 years ago

I don't think that's the right approach. Currently Types can only have one native container type, so decimal can't use both long and Slice. If we want to change that it would be a pretty major change that should be considered separately from implementing the decimal type.

edit Actually, I take that back. Since decimal(5, 5) and decimal(20, 20) are different types they can have different native container types. However, instead of adding an @Filter I think the predicates should go in the @SqlType for the argument

sopel39 commented 9 years ago

@cberner Could you give some example? Do you mean something like:

@SqlType(type = "decimal(aPrecision, aScale)", predicate = "aPrecision <= 18")

wouldn't additional parameters bloat SqlType? I think in such case we should use another annotation class, maybe: SqlConditionalType?

cberner commented 9 years ago

I haven't thought it through in detail. However, it needs to end up as a TypeSignature, so it should probably just be part of the String. I think we should start with something simpler, like making it possible to write ArrayCardinalityFunction and ArraySubscriptOperator before we add support for literal parameters.

losipiuk commented 9 years ago

This is true that filter condition, based on which method is selected, is part of the signature. But IMO it should not be passed as part of @SqlType annotation.

It is possible (and that is true even in aforementioned example of adding of decimals) that condition expression contains argument variables from types of multiple arguments.

For example we can have function with following signature:

@SqlType("decimal(x+1, y+1)")
void fun(@SqlType("decimal(x,y)") a, @SqlType("decimal(v,w)") b) {....}

And the condition for picking such function is:

x+v  > 18

I do not see natural place in any @SqlType annotation for such condition. Separate @Filter annotation seems to be better place to me. Or maybe just @Filter name is poor and we should pick sth else?

Regarding ArrayCardinalityFunction:

Note we already pass type parameters to withExtraParameters method. But we missed that we cannot specify them for method through @SqlType annotation.

One solution could be extending builder with way to explicitly specifying method signature: Then ArrayCardinalityFunction would look more or less like:

class ArrayCardinalityFunction {
  public static long cardinality(Slice slice) {
      return (long) readStructuralBlock(slice).getPositionCount();
  }

// and registration

functionRegistry.addFunctions(parametrizedMethods(ArrayCardinalityFunction.class, "cardinality")
   .withTypeParameters("E")
   .withSignature("bigint", "array<E>")
   .build())
}

The runtime signature of function will be calculated automatically by specialize method generated by builder.

cberner commented 9 years ago

I agree that @SqlType isn't the ideal place. Thinking about this more I'm not sure we need the predicate at all though. The only reason it's needed is to resolve the method with the right native container type, correct? The registry should be able to do that without an annotation, just by looking at the Java method signature

sopel39 commented 9 years ago

Indeed we thought about removing filters altogether. This would probably work for decimal where we use long for short decimals and Slice for long decimals. However for other types just matching by function java signature is not enough since the signature might be the same for multiple variants of the same type. For example we could add another decimal implementation with even more accuracy (e.g. 256 bits) that would also be represented by Slice. In this case, we wouldn't be able to select proper operator, since operator for 128-bit decimal would have the same signature as operator for 256-bit decimal.

Additionally, java operator signature is only known after specialize method is called on matched ParametricFunction. This is too late, therefore filtering mechanism should be added to Signature itself, which is used to select appropriate operator. Signature however is independent from java function signatures, therefore it seems that filtering mechanism should not introduce such dependency.

dain commented 9 years ago

For the first one, I'm not that concerned; if it does happen we can implement some kind of filter mechanism then, but for now the Java type matching will make it easy to write these functions.

As for the second one, I'll let @cberner reply.

cberner commented 9 years ago

Yep, I can see the problem there, but I agree with @dain that we should put that off until we actually need it.

As for the second issue, I was thinking there would be a compiler that would collect all instances of the decimal add method (the 6? MethodHandles). Then it would generate a new instance of ParametricFunction with a specialize method that switched on the concrete types to choose the correct MethodHandle.

p.s. for some reason I'm only getting emails for replies to this issue where I'm mentioned, so please tag me in your replies :)

losipiuk commented 9 years ago

That makes sense. If we base the matching on concrete java types it will be simpler and should solve our current problems. There is just one issue we have to address, though. Different method instances may take different extra parameters.

E.g for ADD operator on decimal we have two groups of methods.

1st: contains just one method:

public static long addShortShortShort(long a, long b, long aRescale, long bRescale) {
   return a * aRescale + b * bRescale;
}

And the other 4 are in the second group. E.g.

public static Slice addShortShortLong(long a, long b, BigInteger aRescale, BigInteger bRescale) {
    BigInteger aBigInteger = BigInteger.valueOf(a);
    BigInteger bBigInteger = BigInteger.valueOf(b);
    return internalAddLongLongLong(aBigInteger, bBigInteger, aRescale, bRescale);
}

W can still live without filters. And handle that by extending the builder DSL a bit. See this gist: https://gist.github.com/losipiuk/c05bdebfcfdfd5be0236

This is not super trivial - but decimal functions are painful. For simpler scenarios it should look nicer.

cc: @cberner, @dain, @sopel39, @ilfrin

losipiuk commented 9 years ago

Kind ping on this one @cberner :)

cberner commented 9 years ago

I'm on vacation for a week, and then have a lot of other code reviews to do, so it'll probably be a bit before I get to this

losipiuk commented 9 years ago

Sure - have a nice time :)

losipiuk commented 9 years ago

@dain, @cberner, @sopel39

Just quick comment to make sure we are on the same page. I am working on defining multiple versions of same scalar function for different type representations of parameters/return type using annotations instead of builder.

Take a look at those 2 snippets and please confirm if you still believe annotation version is better (there is some repeating code there).

annotations version: https://gist.github.com/losipiuk/cdf89620ac37ef230118 builder version: https://gist.github.com/losipiuk/e2de3789edf0151f6a4c

Being honest I personally find builder version more readable due to explicitness.

dain commented 9 years ago

One question on the annotation version, how does the return type work given the both parameters are decimal(p,s)? I would have thought that each parameter would have different argument names like this: @SqlType("decimal(ap,as)") Slice a, @SqlType("decimal(bp,bs)

-dain

On Oct 23, 2015, at 4:21 AM, Łukasz Osipiuk notifications@github.com wrote:

@dain, @cberner, @sopel39

Just quick comment to make sure we are on the same page. I am working on defining multiple versions of same scalar function for different type representations of parameters/return type using annotations instead of builder.

Take a look at those 2 snippets and please confirm if you still believe annotation version is better (there is some repeating code there).

annotations version: https://gist.github.com/losipiuk/cdf89620ac37ef230118 builder version: https://gist.github.com/losipiuk/e2de3789edf0151f6a4c

— Reply to this email directly or view it on GitHub.

losipiuk commented 9 years ago

@dain For handling mixed types we would need to pass extra scaling arguments and define more java functions for each operator. In similar way it is done in current DECIMAL PR using builder.

My outtake from discussion in Boston was that want to avoid defining all the possible versions, and depend on automatic coercion, which will cast both parameters to their common supertype.

e.g.

add(x, y) where x is DECIMAL(1,0) and y is DECIMAL(1,1) will be executed as add(CAST x AS DECIMAL(2,1), CAST y AS DECIMAL(2,1))

Actually there are a work in progress patches which does that in https://github.com/Teradata/presto/commits/decimal (https://github.com/Teradata/presto/commit/449d5161d8d176e6eb76ae8dd10ccff1e9ed84d9 and https://github.com/Teradata/presto/commit/79b58266e2073437aec969d16f309247938ee848)

Downchuck commented 8 years ago

I'm having a hard time following these threads - it seems that this work has been implemented in the Teradata fork for some time: https://github.com/Teradata/presto/pull/89

petroav commented 8 years ago

@Downchuck Yes, Teradata has been working on decimal for a while and we're going through code review with the Facebook engineers at the moment. We use Teradata PRs (like the one you linked to) for internal code reviews before making a PR against facebook/presto.

I believe we're at the fourth iteration of decimal at moment and you can follow the review process here https://github.com/facebook/presto/pull/4426.