trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.23k stars 2.95k forks source link

Difference between native hive implementation and hive serde #19397

Open guptashailesh92 opened 11 months ago

guptashailesh92 commented 11 months ago

HI,

I created the table with OpenX Json.

 CREATE TABLE openxjson_test131 (
    id integer,
    name varchar,
    address array(ROW(city varchar, zipcode varchar, street varchar))
 )
 WITH (
    external_location = 's3://<path>'
 )

data.json

{"id":1,"name":"test1","address":"null"}
{"id":2,"name":"test2","address":null}

Below query failing in native implementation with error

select * from openxjson_test131 where id = 2

io.trino.spi.TrinoException: Failed to read file at s3://<path>/data.json
    at io.trino.plugin.hive.line.LinePageSource.getNextPage(LinePageSource.java:75)
    at io.trino.plugin.hive.HivePageSource.getNextPage(HivePageSource.java:208)
    at io.trino.operator.ScanFilterAndProjectOperator$ConnectorPageSourceToPages.process(ScanFilterAndProjectOperator.java:389)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
    at io.trino.operator.WorkProcessorUtils$YieldingProcess.process(WorkProcessorUtils.java:182)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
    at io.trino.operator.WorkProcessorUtils$BlockingProcess.process(WorkProcessorUtils.java:208)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.lambda$flatten$6(WorkProcessorUtils.java:318)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:360)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
    at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:241)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
    at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:256)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:146)
    at io.trino.operator.Driver.processInternal(Driver.java:402)
    at io.trino.operator.Driver.lambda$process$8(Driver.java:305)
    at io.trino.operator.Driver.tryWithLock(Driver.java:701)
    at io.trino.operator.Driver.process(Driver.java:297)
    at io.trino.operator.Driver.processForDuration(Driver.java:268)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1017)
    at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
    at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:606)
    at io.trino.$gen.Trino_dev____20231011_052918_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: Invalid JSON: Primitive can not be coerced to a ROW
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer.invalidJson(OpenXJsonDeserializer.java:900)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$RowDecoder.decodeValueFromString(OpenXJsonDeserializer.java:796)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$RowDecoder.decodeValue(OpenXJsonDeserializer.java:779)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$RowDecoder.decodeValue(OpenXJsonDeserializer.java:772)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$Decoder.decode(OpenXJsonDeserializer.java:233)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$ArrayDecoder.decodeValue(OpenXJsonDeserializer.java:670)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$Decoder.decode(OpenXJsonDeserializer.java:233)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$RowDecoder.decodeValueFromMap(OpenXJsonDeserializer.java:828)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$RowDecoder.decodeValue(OpenXJsonDeserializer.java:782)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer$RowDecoder.decode(OpenXJsonDeserializer.java:765)
    at io.trino.hive.formats.line.openxjson.OpenXJsonDeserializer.deserialize(OpenXJsonDeserializer.java:159)
    at io.trino.plugin.hive.line.LinePageSource.getNextPage(LinePageSource.java:62)

whereas it works perfectly with hive.

 id |   name   | address
----+----------+---------
  2 | test2 | NULL
(1 row)

I understand that in this case the data is not in sync with defined schema. But hive used to parse only filtered rows whereas native is parsing and converting every row to blockBuilder leading to this error. This will happen for all native implementations.

Can you suggest any workaround for it?

electrum commented 11 months ago

This should be easy to fix. There’s a README in the OpenX directory that explains the supported format (since it’s complicated). Assuming this isn’t already documented, update the document, add a test for the behavior, then change the code to implement it.