trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.43k stars 3k forks source link

Avro timestamp encoder is null when decoding timestamp from bigint #17761

Open pulparindo opened 1 year ago

pulparindo commented 1 year ago

When decoding a timestamp as bigint (millis) column from an Avro based table throws the following error:

Cannot invoke "io.trino.plugin.base.type.TrinoTimestampEncoder.getTimestamp(io.trino.plugin.base.type.DecodedTimestamp)" because "encoder" is null

Steps for repro:

  1. Create Avro Table
CREATE EXTERNAL TABLE `sample_avro_table`(
  `event_type` string, 
  `event_timestamp` bigint)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
WITH SERDEPROPERTIES ( 
'avro.schema.literal'='{\"type\":\"record\",\"name\":\"AE\",\"namespace\":\"xevent.v1\",\"fields\":[{\"name\":\"event_type\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"event_timestamp\",\"type\":{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}}],\"connect.version\":1,\"connect.name \":\"xevent.v1\"}') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  's3://sample-avro-bucket/';
  1. Insert record
INSERT INTO "sample_avro_table" VALUES ('1', 1886723648234234);
  1. Read table
    SELECT * FROM "default"."sample_avro_table";

When executing 3. the query throws such error.

Is this a valid scenario in Trino? Is this issue by design?

I see that when creating the encoders in GenericHiveRecordRecursos class, there is not a valid option for bigint, because the creation of such encoders is based on the table columns definition and not in the avro schema definition (long with timestamp-mills logical annotation) https://github.com/trinodb/trino/blob/f431c3a9aed79917c99665b452773cffd5ac48c6/plugin/trino-hive/src/main/java/io/trino/plugin/hive/GenericHiveRecordCursor.java#L168

However, at the moment of reading the long, the conversion falls as an instance of Timestamp (shortTimestamp) https://github.com/trinodb/trino/blob/f431c3a9aed79917c99665b452773cffd5ac48c6/plugin/trino-hive/src/main/java/io/trino/plugin/hive/GenericHiveRecordCursor.java#L304 but the encoder for the column doesn't exists.

The stack trace of the error is the following:

FailureException
    at io.trino.plugin.hive.GenericHiveRecordCursor.shortTimestamp(GenericHiveRecordCursor.java:660)
    at io.trino.plugin.hive.GenericHiveRecordCursor.getLongExpressedValue(GenericHiveRecordCursor.java:337)
    at io.trino.plugin.hive.GenericHiveRecordCursor.parseLongColumn(GenericHiveRecordCursor.java:322)
    at io.trino.plugin.hive.GenericHiveRecordCursor.parseColumn(GenericHiveRecordCursor.java:579)
    at io.trino.plugin.hive.GenericHiveRecordCursor.isNull(GenericHiveRecordCursor.java:567)
    at io.trino.plugin.hive.HiveRecordCursor.isNull(HiveRecordCursor.java:210)
    at io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:96)
...
bpahuja commented 1 year ago

Hello, @elonazoulay @findepi

If you have any update on what is the expected behavior here in trino with respects to avro.schema.literal in this case, would be much appreciated. Thanks

electrum commented 11 months ago

Can you try with the latest version? We have a new Avro reader implementation in Trino now.

electrum commented 11 months ago

cc @jklamer