prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.08k stars 5.39k forks source link

Issue with druid selecting empty columns #23516

Open Yomanz opened 3 months ago

Yomanz commented 3 months ago

Your Environment

Expected Behavior

These queries should return the correct data from Apache Druid

Current Behavior

Running the query:

SELECT TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE,
  TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX,
  NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB,
  CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE,
  SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE,
  SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN
FROM system.jdbc.columns
ORDER BY TABLE_CAT, TABLE_SCHEM, TABLE_NAME, ORDINAL_POSITION

via the JDBC driver.

Or running: SELECT * FROM table on the SQL CLIENT also gives the same error.

Error:

java.lang.NullPointerException: undefined
    at java.base/java.util.Objects.requireNonNull(Objects.java:221)
    at java.base/java.util.Optional.<init>(Optional.java:107)
    at java.base/java.util.Optional.of(Optional.java:120)
    at com.facebook.presto.druid.DruidMetadata.listTables(DruidMetadata.java:196)
    at com.facebook.presto.druid.DruidMetadata.listTableColumns(DruidMetadata.java:147)
    at com.facebook.presto.metadata.MetadataManager.listTableColumns(MetadataManager.java:562)
    at com.facebook.presto.metadata.MetadataListing.listTableColumns(MetadataListing.java:95)
    at com.facebook.presto.connector.system.jdbc.ColumnJdbcTable.cursor(ColumnJdbcTable.java:126)
    at com.facebook.presto.connector.system.SystemPageSourceProvider$1.cursor(SystemPageSourceProvider.java:130)
    at com.facebook.presto.split.MappedRecordSet.cursor(MappedRecordSet.java:53)
    at com.facebook.presto.spi.RecordPageSource.<init>(RecordPageSource.java:40)
    at com.facebook.presto.connector.system.SystemPageSourceProvider.createPageSource(SystemPageSourceProvider.java:109)
    at com.facebook.presto.spi.connector.ConnectorPageSourceProvider.createPageSource(ConnectorPageSourceProvider.java:55)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:81)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:263)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:441)
    at com.facebook.presto.operator.Driver.lambda$processFor$10(Driver.java:324)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:750)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:317)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:621)
    at com.facebook.presto.$gen.Presto_0_288_15f14bb____20240818_134447_1.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

Possible Solution

I believe it might have something to do with the Apache Druid data having some rows with empty/nulled columns

Steps to Reproduce

  1. Setup Presto
  2. Link Apache Druid
  3. Attempt to run the above queries
  4. See the queries fail with internal errors.

Screenshots (if appropriate)

N/A

Context

Unable to use PrestoDB with tools such as Hex.tech as it cant get the schema.

tdcmeehan commented 3 months ago

Does the SELECT * FROM table query work if you fully specify the catalog and schema?

Yomanz commented 3 months ago

No, if I select the Catalog and Schema and run SELECT * FROM table LIMIT 100;, I get:


java.lang.NullPointerException: string is null
    at java.base/java.util.Objects.requireNonNull(Objects.java:246)
    at io.airlift.slice.Slices.copiedBuffer(Slices.java:291)
    at io.airlift.slice.Slices.utf8Slice(Slices.java:299)
    at com.facebook.presto.druid.DruidBrokerPageSource.getNextPage(DruidBrokerPageSource.java:154)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:266)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:441)
    at com.facebook.presto.operator.Driver.lambda$processFor$10(Driver.java:324)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:750)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:317)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:621)
    at com.facebook.presto.$gen.Presto_0_288_15f14bb____20240818_134447_1.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)```
Yomanz commented 3 months ago

However, I can query the information_schema TABLES table just fine.

tdcmeehan commented 3 months ago

@Yomanz I think these are probably simple fixes. However, unfortunately the Druid connector doesn't have end to end test queries in our CI. This is because, at the time, the Druid community did not provide an embedded instance of Druid that we could load up to run the end to end tests. (See https://github.com/prestodb/presto/pull/14042#issuecomment-582218710, CC @zhenxiao). I think before we fix these issues, we should make sure that we get a sensible baseline in CI, so that it doesn't regress at a later time. So I think we should look into wether or not the Druid community now has the capability to embed a Druid instance inside a running JVM. If so, I think we should add end to end tests.

This might sound like a lot of work, but actually it's not that hard to write the tests themselves, as we provide a framework for connectors to use that tests queries along a whole range of data from TPCH.

@Yomanz would you like to take this issue on? Otherwise, we can see if the community can pick this up.

Yomanz commented 3 months ago

@Yomanz I think these are probably simple fixes. However, unfortunately the Druid connector doesn't have end to end test queries in our CI. This is because, at the time, the Druid community did not provide an embedded instance of Druid that we could load up to run the end to end tests. (See #14042 (comment), CC @zhenxiao). I think before we fix these issues, we should make sure that we get a sensible baseline in CI, so that it doesn't regress at a later time. So I think we should look into wether or not the Druid community now has the capability to embed a Druid instance inside a running JVM. If so, I think we should add end to end tests.

This might sound like a lot of work, but actually it's not that hard to write the tests themselves, as we provide a framework for connectors to use that tests queries along a whole range of data from TPCH.

@Yomanz would you like to take this issue on? Otherwise, we can see if the community can pick this up.

If I can figure out a way to get it running easily I'll take a crack, if not will have to revert to the community 🤞

tdcmeehan commented 3 months ago

Sounds good, please reach out on Slack or here if you need help.