trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.85k stars 2.85k forks source link

Hive JSON and OpenX JSON native readers data corruption #22278

Open mxmarkovics opened 3 weeks ago

mxmarkovics commented 3 weeks ago

Trino's native readers are intended to be a 1:1 (bug for bug) replication of Hive's reader and they are, however, Trino had an additional wrapper layer via the GenericHiveRecordCursor which wrapped the results returned from Hive. The wrapper added some additional logic to handle the case where the Hive reader returned null unexpectedly (i.e. when it failed to parse a non-null value: example) and would throw an error instead of actually appending null. This additional logic was missed when Trino moved to the native readers and the GenericHiveRecordCursor wrapper layer was removed. So now if any of the Hive JSON, OpenX JSON, or Regex native readers fails to parse a non-null value (e.g. 'abc' as an int) it will append a null instead of noticing and throwing an error, which to silent data corruption. This issue is specific to these readers as the other native readers do not support parsing values to other types so there wasn't any parsing error to catch and return null instead.

A reproduction of the issue: Table:

describe "simple_json";
  Column  |  Type   | Extra | Comment
----------+---------+-------+---------
 alphabet | integer |       |

Current Trino reading the above table using OpenX JSON which contains a single column and row, e.g. {"alphabet":"abc"}

select * FROM "simple_json";
 alphabet
----------
     NULL
(1 row)

Old (using Hive readers) Trino with the same setup:

select * FROM "simple_json";
...
HIVE_BAD_DATA: Error reading field value: For input string: "abc"
mxmarkovics commented 3 weeks ago

cc @pettyjamesm

wendigo commented 3 weeks ago

cc @dain

mxmarkovics commented 3 weeks ago

This issue also causes a closely related correctness issue if a user has an is null/is not null filter on the problem column. For those queries, the results returned could have more (with is null filter) or less (with is not null filter) rows than expected instead of just failing.

mxmarkovics commented 3 weeks ago

Actually, looks like this is not a change in behavior for the Regex SerDe as it the Hive SerDe had it's own type validation which would prevent "unexpected" nulls from being returned which would trigger the GenericHiveRecordCursor to catch the unexpected null and throw an error. This is still undesirable behavior however and as part of the fix we should add a config option (disabled by default for consistency) to allow the native Regex SerDe to throw an error rather than return null in these cases. Hive and OpenX JSON SerDes are still impacted by this.