Open CRUDKanade opened 11 months ago
@CRUDKanade Can you please give some more details like number of partitions in the table, average number of files in each partition, average size of each file?
@CRUDKanade您能否提供更多详细信息,例如表中的分区数、每个分区中的平均文件数、每个文件的平均大小?
Hi @imjalpreet . I am currently testing how Presto reads from a Delta Lake table and have observed an issue similar to yours. The planning time of Presto's queries typically takes around 5 to 10 seconds. I have not yet delved into the code, so I'm unsure whether this latency is due to a bug or a design issue. Moreover, I've also found instances where Spark reads from the Delta Lake table faster than Presto does. Adding to this, if the same dataset exists in two versions, one as a Hive table and the other as a Delta table, Presto's querying speed on the Hive table significantly surpasses that of the Delta table. To date, I have not seen Presto complete a query on the Delta Lake table in under ten seconds.
partition number: 2000 partition file number: 5 ~ 20 partition file average size : 128MB
@CRUDKanade, can you tell how did you get the runtime stats ?
Presto Version : 0.284 Table DDLs :
CREATE EXTERNAL TABLE
ods_eport.ods_eport_aed_contrast_city_df
(area_code
string COMMENT '地区编码',area_name
string COMMENT '地区名称',city_name
string COMMENT '地市名称') PARTITIONED BY (ingestion_time
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://zwy/warehouse/tablespace/external/hive/ods_eport.db/ods_eport_aed_contrast_city_df' TBLPROPERTIES ( 'TRANSLATED_TO_EXTERNAL'='TRUE', 'bucketing_version'='2', 'external.table.purge'='TRUE', 'spark.sql.create.version'='2.2 or prior', 'spark.sql.sources.schema.numPartCols'='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"area_code","type":"string","nullable":true,"metadata":{"comment":"地区编码"}},{"name":"area_name","type":"string","nullable":true,"metadata":{"comment":"地区名称"}},{"name":"city_name","type":"string","nullable":true,"metadata":{"comment":"地市名称"}},{"name":"ingestion_time","type":"string","nullable":true,"metadata":{}}]}', 'spark.sql.sources.schema.partCol.0'='ingestion_time', 'transient_lastDdlTime'='1678761651')Quarries: select * from hive.ods_eport.ods_eport_aed_contrast_city_df; Current configs, JVM Flags : jdk1.8 Query Explain plans:
coordinator -server -Xmx8G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8070 query.max-memory=22GB query.max-memory-per-node=5GB query.max-total-memory-per-node=5GB discovery.uri=http://10.241.9.42:8070 discovery-server.enabled=true experimental.reserved-pool-enabled=false
*worker 2** -server -Xmx9G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
coordinator=false node-scheduler.include-coordinator=true http-server.http.port=8070 query.max-memory=22GB query.max-memory-per-node=6GB query.max-total-memory-per-node=6GB discovery.uri=http://10.241.9.42:8070 experimental.reserved-pool-enabled=false