trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.31k stars 2.97k forks source link

Json decoder not work when MillisecondsSinceEpoch represent as scientific notation #7604

Open PChou opened 3 years ago

PChou commented 3 years ago

trino version: <= 354 I'm testing kafka connector, the source topic data show as below(use kafka console consumer):

{"traceId":"06e52ab0-2b96-4cf1-a92c-0e64ee385d16", "timestamp":1.618540600943E12}

notice that, timestamp is represent milliseconds since epoch, but show as scientific notation. Then I try to configure the table descriptor

{
"message": {
    "dataFormat": "json",
    "fields": [{
      "name": "traceId",
      "type": "VARCHAR",
      "mapping": "traceId"
    },{
      "name": "timestamp",
      "type": "TIMESTAMP",
      "mapping": "timestamp",
      "dataFormat": "milliseconds-since-epoch"
    }]
  }
}

but the query failed:

Query failed: could not parse value '1.618458216014E12' as 'timestamp(3)' for column 'timestamp'

from io/trino/decoder/json/MillisecondsSinceEpochJsonFieldDecoder.java I found the exception root cause:

protected long getMillis()
        {
            if (value.isIntegralNumber() && !value.isBigInteger()) {
                return value.longValue();
            }
            if (value.isValueNode()) {
                try {
                    return parseLong(value.asText());
                }
                catch (NumberFormatException e) {
                    throw new TrinoException(
                            DECODER_CONVERSION_NOT_SUPPORTED,
                            format("could not parse value '%s' as '%s' for column '%s'", value.asText(), columnHandle.getType(), columnHandle.getName()));
                }
            }
            throw new TrinoException(
                    DECODER_CONVERSION_NOT_SUPPORTED,
                    format("could not parse non-value node as '%s' for column '%s'", columnHandle.getType(), columnHandle.getName()));
        }

Failure happens at parseLong("1.618458216014E12")

Since some json framework(ie. jackson) may encode long into the format of scientific notation, do we need to support this case?

hashhar commented 3 years ago

I think the issue is that we are treating a JSON Number as an integer/long value when it in fact is a floating point value.

I think this should be as easy as reading the value into a double and casting back into long instead (while being aware of when the value is non-integral).

JoJoBizarreAdventure commented 3 years ago

I found that it is parsed to be a FloatNode, which should be regarded as invalid in TestMillisecondsSinceEpochJsonFieldDecoder. There are frequency lost when it is stored in float, the result will be 1618540625920000. Shall we simply cast float to long or judge when parsing node?

PChou commented 3 years ago

@hashhar @JoJoBizarreAdventure If floating-point numbers are supported, there is indeed a loss of precision. We need to decide how to discard units smaller than milliseconds.

hashhar commented 3 years ago

@PChou IMO:

  1. Use Double.parseDouble to read the values from JSON since JSON doesn't have an integer type.
  2. If the parsed value has a fractional part - throw an exception because none of the formatters support fractional parts of seconds/millis/micros. Silent truncation can lead to misleading expected and actual values for the end-user.

Not that not truncating also has the benefit that it matches the behaviour we already have today.

cc: @losipiuk WDYT?

losipiuk commented 3 years ago

Not that not truncating also has the benefit that it matches the behaviour we already have today.

I would go with truncating. There is a chance that someone depends on support of non 0 fractional part right now.

hashhar commented 3 years ago

Not that not truncating also has the benefit that it matches the behaviour we already have today.

I would go with truncating. There is a chance that someone depends on support of non 0 fractional part right now.

Unlikely because in current code it would throw a NumberFormatException due to the parseLong call.

losipiuk commented 3 years ago

Not that not truncating also has the benefit that it matches the behaviour we already have today.

I would go with truncating. There is a chance that someone depends on support of non 0 fractional part right now.

Unlikely because in current code it would throw a NumberFormatException due to the parseLong call.

Yeah, you are right. Then validation seems fine give any json value representing integer should be represented as integer double value (without fractional part) under IEEE 754.