prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Presto Native DeltaLake Connector + s3 external_location unsupported error #18154

Open mchutani-CS opened 2 years ago

mchutani-CS commented 2 years ago

Hi devs, I'm pretty new to the ecosystem and was trying to setup Presto with hive to query a deltatable located in Amazon S3. I'm trying to run this setup on my MAC (intel). Any guidance would be highly appreciated.

PRESTO VERSION: 0.274 HIVE VERSION: 3.1.2 QUERY:

presto:default> CREATE TABLE delta.default.delta_data (dummyColumn INT) WITH (external_location = 's3://test-1');

RESULT:

Query 20220808_200017_00002_e4w6d failed: Catalog 'delta' does not support table property 'external_location'
com.facebook.presto.spi.PrestoException: Catalog 'delta' does not support table property 'external_location'
    at com.facebook.presto.metadata.AbstractPropertyManager.getProperties(AbstractPropertyManager.java:94)
    at com.facebook.presto.execution.CreateTableTask.internalExecute(CreateTableTask.java:189)
    at com.facebook.presto.execution.CreateTableTask.execute(CreateTableTask.java:87)
    at com.facebook.presto.execution.CreateTableTask.execute(CreateTableTask.java:69)
    at com.facebook.presto.execution.DDLDefinitionExecution.executeTask(DDLDefinitionExecution.java:58)
    at com.facebook.presto.execution.DataDefinitionExecution.start(DataDefinitionExecution.java:203)
    at com.facebook.presto.$gen.Presto_0_274_f87057c____20220808_195942_1.run(Unknown Source)
    at com.facebook.presto.execution.SqlQueryManager.createQuery(SqlQueryManager.java:266)
    at com.facebook.presto.dispatcher.LocalDispatchQuery.lambda$startExecution$8(LocalDispatchQuery.java:197)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

Here's my config:

HIVE: conf/hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- These are default values meant to allow easy smoke testing of the metastore.  You will
likely need to add a number of new values. -->
<configuration>
  <property>
    <name>metastore.thrift.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>
  <property>
    <name>metastore.task.threads.always</name>
    <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
  </property>
  <property>
    <name>metastore.expression.proxy</name>
    <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
  </property>
  <property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3.aws-secret-key</name>
    <value>secret-key</value>
  </property>
  <property>
    <name>fs.s3.aws-access-key</name>
    <value>access-key</value>
  </property>
  <property>
    <name>fs.s3.endpoint</name>
    <value>https://s3.amazonaws.com</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>
</configuration>

Presto: etc/node.properties

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/usr/local/var/presto/data

etc/jvm.config

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

etc/catalog/delta.properties

connector.name=delta
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=access-key
hive.s3.aws-secret-key=secret-key
hive.s3.endpoint=https://s3.amazonaws.com
agrawalreetika commented 2 years ago

Hi @mchutani-CS As per the configuration, you are trying to CREATE TABLE via Delta Connector, which is not supported. You can run CREATE TABLE command via the hive connector, please use the hive connector instead here. https://prestodb.io/docs/current/connector/hive.html#configuration

mchutani-CS commented 2 years ago

Hi @agrawalreetika . Thanks for your response. My usecase is to test the read performance of PrestoDB using the DeltaConnector since my DeltaTable resides in a cloud storage (s3, gs). As per my understanding, if I use the Delta connector, the hive metastore is used to store metadata about the Delta Table. So with this CREATE TABLE command, I was trying to update hive metastore with the metadata about my Delta Table. Do you have a suggestion on how to go about it? Should I use a combination of Delta and hive connector?

javrod87 commented 2 years ago

The Trino fork of Presto has a Delta Lake connector that supports writes. You might want to give that a try.

agrawalreetika commented 2 years ago

Hi @mchutani-CS, You can use CREATE TABLE command via Hive Connector to register your delta tables in HMS and then make use of Presto delta connector to access the data. Ref - https://prestodb.io/docs/current/connector/deltalake.html Let me know if you have any questions.

mchutani-CS commented 2 years ago

Hi @agrawalreetika,

Thank you for your suggestion. I'm going to try it out. I have a question. Are there any read performance benefits if I register by delta tables in HMS vs directly accessing it through $path like this -> SELECT * FROM sales."$path$"."s3://db-sa-datasets/presto/sales_data" LIMIT 200;?

agrawalreetika commented 2 years ago

Hi @mchutani-CS Basically when you register table in HMS, presto-delta connector will get the location of delta table from HMS and then the flow for metadata and data access would be similar to when you access table using $path So I believe until your HMS call for getting location is not a bottleneck there shouldn't be any performance difference.

lohithalla commented 2 years ago

Hi @agrawalreetika , I was able to query the delta table present in s3, but it is taking very long time. As I saw the logs of presto, some errors were raised before the query execution completes.

Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: N2A11KNDX3BQEEMM; S3 Extended Request ID: l8bMzArIpHdUDBxerMYF1xANqEc7ikQruC57so9TbCD5VrHlfOacjjtzIMD4FhslN8dkjzxtvNE=), S3 Extended Request ID: l8bMzArIpHdUDBxerMYF1xANqEc7ikQruC57so9TbCD5VrHlfOacjjtzIMD4FhslN8dkjzxtvNE= (Path: s3a:///sample/_delta_log/_last_checkpoint)

agrawalreetika commented 2 years ago

Hi @lohithalla, This issue is getting fixed in this PR - #18307, which is part of last release 0.277. Could you please try it out there.

lohithalla commented 2 years ago

@agrawalreetika I am using ahanio/presto latest image and still facing the issue, is that fix not avaiable in 0.277?

agrawalreetika commented 2 years ago

Yes, it should be in 0.277

lohithalla commented 2 years ago

No i guess....just checked now in 0.277 release notes as well...not mentioned about this change

agrawalreetika commented 2 years ago

I missed this. It would come in presto-0.278. I just rechecked the commits. Sorry for the confusion. Here -

image
medams commented 2 years ago

@agrawalreetika I was trying to setup Presto with Hive Connector to query data from S3 bucket. I'm following below document https://prestodb.io/docs/current/connector/hive.html to develop the POC. I was able to create hive.web.page_views table and list the partitions of the page_views table, but while creating an external Hive table its failed: ERROR: External location must be a directory. would you please offer some suggestions? Thanks in advance

Query: CREATE TABLE hive.web.request_logs ( request_time timestamp, url varchar, ip varchar, user_agent varchar ) WITH ( format = 'TEXTFILE', external_location = 's3://my-bucket/data/logs/' )

ERROR: com.facebook.presto.spi.PrestoException: External location must be a directory at com.facebook.presto.hive.HiveMetadata.getExternalPath(HiveMetadata.java:1308) at com.facebook.presto.hive.HiveMetadata.prepareTable(HiveMetadata.java:1011) at com.facebook.presto.hive.HiveMetadata.createTable(HiveMetadata.java:958) at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.createTable(ClassLoaderSafeConnectorMetadata.java:356) at com.facebook.presto.metadata.MetadataManager.createTable(MetadataManager.java:656) at com.facebook.presto.execution.CreateTableTask.internalExecute(CreateTableTask.java:201) at com.facebook.presto.execution.CreateTableTask.execute(CreateTableTask.java:87) at com.facebook.presto.execution.CreateTableTask.execute(CreateTableTask.java:69) at com.facebook.presto.execution.DDLDefinitionExecution.executeTask(DDLDefinitionExecution.java:58) at com.facebook.presto.execution.DataDefinitionExecution.start(DataDefinitionExecution.java:209) at com.facebook.presto.$gen.Presto_0_277_119fd9a____20221020_192344_1.run(Unknown Source) at com.facebook.presto.execution.SqlQueryManager.createQuery(SqlQueryManager.java:286) at com.facebook.presto.dispatcher.LocalDispatchQuery.lambda$startExecution$8(LocalDispatchQuery.java:197) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

agrawalreetika commented 2 years ago

Hi @medams , Could you please let me know which version of presto are you using?

medams commented 2 years ago

Versions I'm using Presto 0.277 Hadoop: 3.3.1 Hive 2.3.9 I'm able to create table when we have one file under one directory. if we have multiple files under one directory we are getting com.facebook.presto.spi.PrestoException: External location must be a directory.

agrawalreetika commented 2 years ago

@medams, I tried reproducing the issue on my end but I am not able to. I used Presto 0.277, I am able to create an external table fine on s3 data. I am assuming you already have the s3 location existing before creating an external table via Presto?