prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.02k stars 5.36k forks source link

Planning time and Execution time is too long #21441

Open CRUDKanade opened 11 months ago

CRUDKanade commented 11 months ago

Presto Version : 0.284 Table DDLs :

CREATE EXTERNAL TABLE ods_eport.ods_eport_aed_contrast_city_df( area_code string COMMENT '地区编码', area_name string COMMENT '地区名称', city_name string COMMENT '地市名称') PARTITIONED BY ( ingestion_time string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://zwy/warehouse/tablespace/external/hive/ods_eport.db/ods_eport_aed_contrast_city_df' TBLPROPERTIES ( 'TRANSLATED_TO_EXTERNAL'='TRUE', 'bucketing_version'='2', 'external.table.purge'='TRUE', 'spark.sql.create.version'='2.2 or prior', 'spark.sql.sources.schema.numPartCols'='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"area_code","type":"string","nullable":true,"metadata":{"comment":"地区编码"}},{"name":"area_name","type":"string","nullable":true,"metadata":{"comment":"地区名称"}},{"name":"city_name","type":"string","nullable":true,"metadata":{"comment":"地市名称"}},{"name":"ingestion_time","type":"string","nullable":true,"metadata":{}}]}', 'spark.sql.sources.schema.partCol.0'='ingestion_time', 'transient_lastDdlTime'='1678761651')

Quarries: select * from hive.ods_eport.ods_eport_aed_contrast_city_df; Current configs, JVM Flags : jdk1.8 Query Explain plans: image image image image

coordinator -server -Xmx8G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError

coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8070 query.max-memory=22GB query.max-memory-per-node=5GB query.max-total-memory-per-node=5GB discovery.uri=http://10.241.9.42:8070 discovery-server.enabled=true experimental.reserved-pool-enabled=false

*worker 2** -server -Xmx9G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError

coordinator=false node-scheduler.include-coordinator=true http-server.http.port=8070 query.max-memory=22GB query.max-memory-per-node=6GB query.max-total-memory-per-node=6GB discovery.uri=http://10.241.9.42:8070 experimental.reserved-pool-enabled=false

imjalpreet commented 10 months ago

@CRUDKanade Can you please give some more details like number of partitions in the table, average number of files in each partition, average size of each file?

fuyun2024 commented 10 months ago

@CRUDKanade您能否提供更多详细信息,例如表中的分区数、每个分区中的平均文件数、每个文件的平均大小?

Hi @imjalpreet . I am currently testing how Presto reads from a Delta Lake table and have observed an issue similar to yours. The planning time of Presto's queries typically takes around 5 to 10 seconds. I have not yet delved into the code, so I'm unsure whether this latency is due to a bug or a design issue. Moreover, I've also found instances where Spark reads from the Delta Lake table faster than Presto does. Adding to this, if the same dataset exists in two versions, one as a Hive table and the other as a Delta table, Presto's querying speed on the Hive table significantly surpasses that of the Delta table. To date, I have not seen Presto complete a query on the Delta Lake table in under ten seconds.

partition number: 2000 partition file number: 5 ~ 20 partition file average size : 128MB

Akanksha-kedia commented 10 months ago

@CRUDKanade, can you tell how did you get the runtime stats ?