trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.28k stars 2.96k forks source link

Prevent potential overflow when converting timestamp values #5168

Open findepi opened 4 years ago

findepi commented 4 years ago
$ git grep '* MICROSECONDS_PER_SECOND'
presto-base-jdbc/src/main/java/io/prestosql/plugin/jdbc/StandardColumnMappings.java:        return instant.getEpochSecond() * MICROSECONDS_PER_SECOND + roundDiv(instant.getNano(), NANOSECONDS_PER_MICROSECOND);
presto-main/src/main/java/io/prestosql/operator/scalar/DateTimeFunctions.java:        return Math.round(unixTime * MICROSECONDS_PER_SECOND);
presto-main/src/main/java/io/prestosql/operator/scalar/timestamp/VarcharToTimestampCast.java:        return epochSecond * MICROSECONDS_PER_SECOND + rescale(fractionValue, actualPrecision, 6);
presto-main/src/main/java/io/prestosql/type/DateTimes.java:        return epochSecond * MICROSECONDS_PER_SECOND + roundDiv(picoOfSecond, PICOSECONDS_PER_MICROSECOND);
presto-main/src/main/java/io/prestosql/type/DateTimes.java:                start.getEpochSecond() * MICROSECONDS_PER_SECOND + start.getLong(MICRO_OF_SECOND),
presto-main/src/test/java/io/prestosql/operator/scalar/timestamp/TestTimestamp.java:            long epochMicros = base.toEpochSecond(offset) * MICROSECONDS_PER_SECOND + picoOfSecond / PICOSECONDS_PER_MICROSECOND;
presto-orc/src/main/java/io/prestosql/orc/reader/TimestampColumnReader.java:        long micros = (seconds + baseTimestampInSeconds) * MICROSECONDS_PER_SECOND;
presto-orc/src/main/java/io/prestosql/orc/reader/TimestampColumnReader.java:        long micros = (seconds + baseTimestampInSeconds) * MICROSECONDS_PER_SECOND;
presto-orc/src/test/java/io/prestosql/orc/OrcTester.java:                long micros = timestamp.toEpochSecond() * MICROSECONDS_PER_SECOND;
presto-orc/src/test/java/io/prestosql/orc/OrcTester.java:                long micros = timestamp.toEpochSecond() * MICROSECONDS_PER_SECOND;
$ git grep '* MICROSECONDS_PER_MILLISECOND'
[...]

etc

findepi commented 4 years ago

When converting timestamp values between units (seconds, milliseconds, microseconds) we should multiply with checking for overflow unless the value can be statically proven not to cause overflow.

Maybe it would be possible to make the constants like MICROSECONDS_PER_SECOND, MICROSECONDS_PER_MILLISECOND, etc private and provide some reusable conversion functions instead? This would solve some other problem too -- that it's too easy to write conversion code that works "slightly incorrectly" for values before the epoch.

cc @martint @losipiuk @aalbu

findepi commented 4 years ago

@martint please define the overall approach so that someone can work on this

jirassimok commented 4 years ago

(I only considered picoseconds while writing the first part of this; see the end for thoughts regarding all time units.)

The most common operation we use on picoseconds is unit conversion: I think a straightforward solution would be a Picoseconds (or Picos) utility class with an API similar to TimeUnit.

We'd want toNanos, toMillis, etc, and either the corresponding from functions or convert(long, TimeUnit) (like TimeUnit.convert1). Because we frequently round during these conversions, we'd also want roundToNanos, etc.

We use two other operations on picoseconds: conversions to picos-of-day and extraction of other time fields (e.g. minute-of-hour). These could be implemented in a new class or as a few extra methods on the same class as the conversions.2

1 Personally, I prefer Picos.from(long, TimeUnit).

2 Another possibility would be to implement the picosecond-conversion class as a one-element enum (as if it were an extension of TimeUnit) and leaving the additional methods for dealing with picos-of-day as regular static methods.


Looking at the conversion functions already in DateTimes, it wouldn't be too hard to expand them, adding a variety of scaleXToY/roundXToY/truncateXToY methods. I'd still prefer an API closer to TimeUnit, but in the end, they aren't all that different.

I think we still want a separate class/API for extracting specific time fields, though, because we'll have both getMicrosOfSecond(micros) and getSecondsOfMinute(picos), which could get ambiguous easily. A clean solution to this might involve having a utility class for each time representation (or one enum class for all of them) we use and giving those the appropriate getXOfY methods. Those utility classes could also help avoid ambiguities like toEpochMicros(millis, picos) and toEpochMicros(second, picos).

jirassimok commented 3 years ago

@martint What do you think?

martint commented 3 years ago

I'm looking into this, but here are a couple of thoughts:

jirassimok commented 3 years ago

For dynamic precision, we could keep versions of the methods we already have for that purpose, something with the same basic idea as rescale(value, fromPrecision, toPrecision), but ideally with a more type-specific name (or namespace).

We already have this function in multiple places, each with slightly different semantics. If we renamed it to something like rescaleTimestamp and gave it appropriate rounding semantics, overflow checking, and documentation, it could work well as-is and could replace many of the uses of the conversion constants.

Using this pattern, we could have just one or two functions per operation type per datetime "type" (i.e. timestamp vs subunit-of-unit). We could also provide the integer constants representing each unit's scale to produce slightly more readable code (rescale(n, NANO, PICO) vs rescale(n, 9, 12)).


We also don't currently use fully-consistent rounding behavior for timestamps (or time-of-day). Timestamps.roundDiv rounds much like Math.round, while DateTimes.rescale, which is used in much the same way, uses regular long division, which truncates in different direction above and below zero (presumably a bad choice for epoch-based times).

(The same goes for other datetime values, though I think we want floor rounding rather than half-ceiling rounding for units like time-of-day.)

martint commented 3 years ago

Timestamps.roundDiv rounds much like Math.round, while DateTimes.rescale

rescale is generally used after the value has been rounded to the desired magnitude. E.g.,

if (sourcePrecision > targetPrecision) {
    picoFraction = round(picoFraction, (int) (TimeType.MAX_PRECISION - targetPrecision));
}

long microFraction = rescale(picoFraction, TimeType.MAX_PRECISION, 6);

or

interval = round(interval, 3);
interval = rescale(interval, MAX_SHORT_PRECISION, 3);