vieyahn2017 / repos

【已经迁移到goto/javaway】
2 stars 1 forks source link

Presto实践收集 #24

Closed vieyahn2017 closed 5 years ago

vieyahn2017 commented 5 years ago

Presto实践

vieyahn2017 commented 5 years ago

Presto的研究总结(朱忠友) 日期:2017-08-15 11:13浏览:203回复:4

在老蒋支持下,完成了9台服务器的presto的搭建,刚开始踩了很多坑,presto稳定性不好,经过多次调试,适配参数,现在presto稳定性已经大大提高。

测试描述: 测试表名: dwr_ad_exposure_dt0,表大小27739256535(277.3亿条),表大小为:5.8T,表格式为parquet file

测试语句一:

select device_type,count(1) cnt from dmp_mkt.dwr_ad_exposure_dt0 where l_date>='2017-07-06' and l_date<='2017-08-06' group by device_type order by cnt desc limit 10;

测试语句二:

select count(1) from dmp_mkt.dwr_ad_exposure_dt0 where l_date>'2017-07-06' and l_date<='2017-08-06';

测试语句三:

select count(1) from dmp_mkt.dwr_ad_exposure_dt0;
测试场景 presto第1次执行时间(s) presto第2次执行时间(s) presto平均时间(s) spark-beeline第1次执行时间(s) spark-beeline第2次执行时间(s) spark-beeline平均时间(s) 说明
测试语句一 7 6 6.5 45.5 29 37.25 presto9台服务器,spark-beeline有68台服务器
测试语句二 6 6 6 47.7 12.5 30.1
测试语句三 96 97 96.5 240.6 220.8 230.7

总结:在277.3亿条数据量下,虽然presto服务器数量比spark-beeline少很多,但是presto的性能确实非常好,性能比spark-beeline提高3——6倍,秒杀spark-beeline

由于时间有限,今天只测试部分单表测试,这两天我会增加presto节点,进行更复杂多表join计算。进一步测试prestor的稳定性,调优性能,验证使用场景

vieyahn2017 commented 5 years ago

一评论 【某公司的所有Presto Committer, 已经转向SparkSQL了】

vieyahn2017 commented 5 years ago

Presto与Spark SQL查询性能比较

2018年09月20日 15:53:41 蚁方阵 阅读数:1531 版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/yiifaa/article/details/82788727

  1. 数仓环境
指标名称 指标值
数据总量 24T
分区数量 24
存储类型 Text
Spark SQL版本 2.2.0-218
Pres同版本 Presto CLI 04fb3c3-dirty
  1. 分页查询 所有的计算时间都以秒为单位,执行的SQL语句如下:
    select * from mydb 
    where year='2018' and month='09' and day='09' 
    and remote_addr='117.136.68.11' 
    and host like '%poi%'  limit 10;
执行环境 limit 10 limit 100 limit 200
Presto 2 23 42
Spark SQL 34.08 237.395 120
Hive 4413 忽略 忽略

另外,我注意到Presto自带分页功能,所以在执行查询命令时感觉更快,几乎就是刚输入命令,结果就呈现出来了,交互性领先。

  1. 全量结果 相对于分页查询,这次去掉了limit限制,以及增加了计数查询,如下:

测试1

select * from mydb where year='2018' and month='09' and day='09' 
and remote_addr='117.136.68.11' and host like '%poi%';

测试2

select count(*) from mydb where year='2018' and month='09' and day='09' 
and remote_addr='117.136.68.11' and host like '%poi%';

执行结果,如下表所示:

执行环境 全量数据
Presto 2080
Spark SQL 661.493
  1. 其他 相比于Spark SQL,Presto有着更快的启动时间。

相比于Presto,Spark SQL更容易指定执行节点的数量,速度提升更容易,下表是Spark SQL节点数量的速度变化情况。

 spark-sql --num-executors 10
节点数量 执行时间
100 661.493
24 662.734
10 625.403

从上表可以看出,Spark SQL所需的执行节点应尽可能与分区数量相等(正在做进一步详尽的测试,请稍等),再多也只是资源浪费。

跟Presto相比,Spark SQL即使在执行节点数量相等时(10个),Spark SQL的速度也大幅领先。

  1. 结论 在执行分页查询时,Presto具有明显的优势,但需要全量数据与统计时,Spark SQL则将Presto远远甩在身后。
vieyahn2017 commented 5 years ago

presto的性能优化之历程

日期:2018-02-06 10:07浏览:131回复:1

通过presto日志,两个sql对比,可以看到,两个sql执行扫描的数据不一样,一个是232G,一个是89G

所以性能差异大,更改有效

维表查询有同样的问题,查询数据264G

用时14s

请尽快验证并实施

2018年2月5日 14:14

Execution
Elapsed Time

1.27m


SELECT "F_FCST_W"."bucket_id" AS "bucket_id",
  SUM((CASE WHEN (("F_FCST_W"."wtf_type" = 'REGION_FCST') AND ("F_FCST_W"."plan_id" = "t0"."x_measure__0")) THEN "F_FCST_W"."qty" ELSE 0 END)) AS "sum_current_hq_atp",
  SUM(CAST(NULL AS BIGINT)) AS "sum_calculation_39",
  SUM((CASE WHEN ("F_FCST_W"."wtf_type" = 'REGION_FCST') THEN (CASE WHEN 1000 = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST("F_FCST_W"."qty" AS DOUBLE) / 1000 END) ELSE CAST(NULL AS DOUBLE) END)) AS "sum_calculation_96"
FROM (
  select 
  a.*, 
  b.USERS_ID

  from 
  dmp_dmiplan.dm_iplan_rpt_sell_out_fcst_week_f a,
   ( select  distinct  USERS_ID, prod_model , prod_area,prod_family ,prod_line, REGION_ORG_CODE ,BRANCH, COUNTRY
      from dmp_dmiplan.dm_iplan_user_auth_d  ) b
   where  b.prod_line is not null and                                                               
  ( A.lv5_prod_list_code = b.prod_model or b.prod_model = 'all' ) and                                      
  ( A.lv4_prod_list_code = b.prod_family or b.prod_family = 'all' ) and                                   
  ( A.lv3_prod_list_code = b.prod_area or b.prod_area = 'all' ) and                                         
  ( A.lv1_prod_list_code = b.prod_line or b.prod_line = 'all' ) and                                         
  ( A.REGION_ORG_CODE = b.REGION_ORG_CODE or b.REGION_ORG_CODE = 'ALL' )
) "F_FCST_W"
  LEFT JOIN "dmp_dmiplan"."dim_week_plan_id" "D_PLANID_W" ON ("F_FCST_W"."plan_id" = "D_PLANID_W"."plan_id")
  LEFT JOIN "dmp_dmiplan"."dim_week_middle_lv3_name" "D_MIDDLE_LV3_W" ON ("F_FCST_W"."middle_lv3_name" = "D_MIDDLE_LV3_W"."middle_lv3_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_prod_en_name" "D_PROD_MODEL_W" ON ("F_FCST_W"."prod_en_name" = "D_PROD_MODEL_W"."prod_en_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_method_type" "D_METHODTYPE_W" ON ("F_FCST_W"."method_type" = "D_METHODTYPE_W"."method_type")
  LEFT JOIN "dmp_dmiplan"."dim_week_lv5_prod_list_en_name" "D_LV5_W" ON ("F_FCST_W"."lv5_prod_list_en_name" = "D_LV5_W"."lv5_prod_list_en_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_lv3_prod_list_en_name" "D_LV3_W" ON ("F_FCST_W"."lv3_prod_list_en_name" = "D_LV3_W"."lv3_prod_list_en_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_company_brand" "D_BRAND_W" ON ("F_FCST_W"."company_brand" = "D_BRAND_W"."company_brand")
  LEFT JOIN "dmp_dmiplan"."dim_week_bucket_id" "D_BUCKET_W" ON ("F_FCST_W"."bucket_id" = "D_BUCKET_W"."bucket_id")
  LEFT JOIN "dmp_dmiplan"."dim_week_region_org_en_name" "D_REGION_W" ON ("F_FCST_W"."region_org_en_name" = "D_REGION_W"."region_org_en_name")
  CROSS JOIN (
  SELECT MAX("F_FCST_W"."plan_id") AS "x_measure__0",
    COUNT(1) AS "x__alias__0"
  FROM (
    select 
    a.*, 
    b.USERS_ID
    from 
    dmp_dmiplan.dm_iplan_rpt_sell_out_fcst_week_f a,
     ( select  distinct  USERS_ID, prod_model , prod_area,prod_family ,prod_line, REGION_ORG_CODE ,BRANCH, COUNTRY
        from dmp_dmiplan.dm_iplan_user_auth_d  ) b
     where  b.prod_line is not null and                                                               
    ( A.lv5_prod_list_code = b.prod_model or b.prod_model = 'all' ) and                                      
    ( A.lv4_prod_list_code = b.prod_family or b.prod_family = 'all' ) and                                   
    ( A.lv3_prod_list_code = b.prod_area or b.prod_area = 'all' ) and                                         
    ( A.lv1_prod_list_code = b.prod_line or b.prod_line = 'all' ) and                                         
    ( A.REGION_ORG_CODE = b.REGION_ORG_CODE or b.REGION_ORG_CODE = 'ALL' )
  ) "F_FCST_W"
  WHERE ("F_FCST_W"."users_id" = 'l00359001')
  HAVING (COUNT(1) > 0)
) "t0"
WHERE (("F_FCST_W"."users_id" = 'l00359001') AND (("F_FCST_W"."bom_type" IN ('', 'ATO FG', 'ATO MF', 'Invalid', 'MTO FG', 'MTO MF', 'MTS FG', 'N/A', 'PBOM', 'VIRTUAL')) OR ("F_FCST_W"."bom_type" IS NULL)) AND (("F_FCST_W"."lv1_prod_list_en_name" IN ('ALL', 'Handset', 'N/A')) OR ("F_FCST_W"."lv1_prod_list_en_name" IS NULL)) AND (("F_FCST_W"."lv2_prod_list_en_name" IN ('ALL', 'Smart Phone')) OR ("F_FCST_W"."lv2_prod_list_en_name" IS NULL)) AND ("F_FCST_W"."lv3_prod_list_en_name" = 'Mate series') AND ("F_FCST_W"."plan_id" = '20180205') AND ((CASE WHEN ("F_FCST_W"."region_org_en_name" IN ('China Region', 'Consumer BG Greater China Region', 'Greater China Device Business Dept')) THEN ' Greater China Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('CEE & Nordic European Device Business Dept', 'CEE & Nordic European Region')) THEN 'CEE & Nordic European Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('Canada Device Business Dept', 'Canada Rep Office')) THEN 'Canada Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('India Device Business Dept', 'India Rep Office')) THEN 'India Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('Latin America Device Business Dept', 'Latin America Region')) THEN 'Latin America Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('Russia Device Business Dept', 'Russia Rep Office')) THEN 'Russia Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('Southern Pacific Device Business Dept', 'Southern Pacific Region')) THEN 'Southern Pacific Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('Southern-East Asia Device Business Dept', 'Southern-East Asia Region')) THEN 'Southern-East Asia Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('West European Device Business Dept', 'West European Region')) THEN 'West European Device Business Dept ' ELSE "F_FCST_W"."region_org_en_name" END) = ' Greater China Device Business Dept'))
GROUP BY 1

2018年2月6日 9:42

今天继续执行得37秒

然后按如下方式改后,只要3.9秒

SELECT "F_FCST_W"."bucket_id" AS "bucket_id",
  SUM((CASE WHEN (("F_FCST_W"."wtf_type" = 'REGION_FCST') AND ("F_FCST_W"."plan_id" = "t0"."x_measure__0")) THEN "F_FCST_W"."qty" ELSE 0 END)) AS "sum_current_hq_atp",
  SUM(CAST(NULL AS BIGINT)) AS "sum_calculation_39",
  SUM((CASE WHEN ("F_FCST_W"."wtf_type" = 'REGION_FCST') THEN (CASE WHEN 1000 = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST("F_FCST_W"."qty" AS DOUBLE) / 1000 END) ELSE CAST(NULL AS DOUBLE) END)) AS "sum_calculation_96"
FROM (
  select
  a.*,
  b.USERS_ID
  from
  dmp_dmiplan.dm_iplan_rpt_sell_out_fcst_week_f a,
   ( select  distinct  USERS_ID, prod_model , prod_area,prod_family ,prod_line, REGION_ORG_CODE ,BRANCH, COUNTRY
      from dmp_dmiplan.dm_iplan_user_auth_d where users_id = 'l00359001' ) b
   where  b.prod_line is not null and
          (CASE WHEN A.lv5_prod_list_code = b.prod_model THEN 1
WHEN b.prod_model = 'all' THEN 1 ELSE 2
END) = 1   and
                    (CASE WHEN  A.lv4_prod_list_code = b.prod_family THEN 1
WHEN b.prod_family = 'all' THEN 1 ELSE 2
END) = 1   and
 (CASE WHEN  A.lv3_prod_list_code = b.prod_area THEN 1
WHEN b.prod_area = 'all' THEN 1 ELSE 2
END) = 1   and
        (CASE WHEN  A.lv1_prod_list_code = b.prod_line THEN 1
WHEN b.prod_line = 'all' THEN 1 ELSE 2
END) = 1   and                           
       (CASE WHEN  A.REGION_ORG_CODE = b.REGION_ORG_CODE THEN 1
WHEN b.REGION_ORG_CODE = 'ALL' THEN 1 ELSE 2
END) = 1
) "F_FCST_W"
  LEFT JOIN "dmp_dmiplan"."dim_week_plan_id" "D_PLANID_W" ON ("F_FCST_W"."plan_id" = "D_PLANID_W"."plan_id")
  LEFT JOIN "dmp_dmiplan"."dim_week_middle_lv3_name" "D_MIDDLE_LV3_W" ON ("F_FCST_W"."middle_lv3_name" = "D_MIDDLE_LV3_W"."middle_lv3_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_prod_en_name" "D_PROD_MODEL_W" ON ("F_FCST_W"."prod_en_name" = "D_PROD_MODEL_W"."prod_en_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_method_type" "D_METHODTYPE_W" ON ("F_FCST_W"."method_type" = "D_METHODTYPE_W"."method_type")
  LEFT JOIN "dmp_dmiplan"."dim_week_lv5_prod_list_en_name" "D_LV5_W" ON ("F_FCST_W"."lv5_prod_list_en_name" = "D_LV5_W"."lv5_prod_list_en_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_lv3_prod_list_en_name" "D_LV3_W" ON ("F_FCST_W"."lv3_prod_list_en_name" = "D_LV3_W"."lv3_prod_list_en_name")
  LEFT JOIN "dmp_dmiplan"."dim_week_company_brand" "D_BRAND_W" ON ("F_FCST_W"."company_brand" = "D_BRAND_W"."company_brand")
  LEFT JOIN "dmp_dmiplan"."dim_week_bucket_id" "D_BUCKET_W" ON ("F_FCST_W"."bucket_id" = "D_BUCKET_W"."bucket_id")
  LEFT JOIN "dmp_dmiplan"."dim_week_region_org_en_name" "D_REGION_W" ON ("F_FCST_W"."region_org_en_name" = "D_REGION_W"."region_org_en_name")
  CROSS JOIN (
  SELECT MAX("F_FCST_W"."plan_id") AS "x_measure__0",
    COUNT(1) AS "x__alias__0"
  FROM (
    select
    a.*,
    b.USERS_ID
    from
    dmp_dmiplan.dm_iplan_rpt_sell_out_fcst_week_f a,
     ( select  distinct  USERS_ID, prod_model , prod_area,prod_family ,prod_line, REGION_ORG_CODE ,BRANCH, COUNTRY
        from dmp_dmiplan.dm_iplan_user_auth_d where users_id = 'l00359001' ) b
     where  b.prod_line is not null and
          (CASE WHEN A.lv5_prod_list_code = b.prod_model THEN 1
WHEN b.prod_model = 'all' THEN 1 ELSE 2
END) = 1   and
                    (CASE WHEN  A.lv4_prod_list_code = b.prod_family THEN 1
WHEN b.prod_family = 'all' THEN 1 ELSE 2
END) = 1   and
 (CASE WHEN  A.lv3_prod_list_code = b.prod_area THEN 1
WHEN b.prod_area = 'all' THEN 1 ELSE 2
END) = 1   and
        (CASE WHEN  A.lv1_prod_list_code = b.prod_line THEN 1
WHEN b.prod_line = 'all' THEN 1 ELSE 2
END) = 1   and                           
       (CASE WHEN  A.REGION_ORG_CODE = b.REGION_ORG_CODE THEN 1
WHEN b.REGION_ORG_CODE = 'ALL' THEN 1 ELSE 2
END) = 1 
  ) "F_FCST_W"
  WHERE ("F_FCST_W"."users_id" = 'l00359001')
  HAVING (COUNT(1) > 0)
) "t0"
WHERE (("F_FCST_W"."users_id" = 'l00359001') AND (("F_FCST_W"."bom_type" IN ('', 'ATO FG', 'ATO MF', 'Invalid', 'MTO FG', 'MTO MF', 'MTS FG', 'N/A', 'PBOM', 'VIRTUAL')) OR ("F_FCST_W"."bom_type" IS NULL)) AND (("F_FCST_W"."lv1_prod_list_en_name" IN ('ALL', 'Handset', 'N/A')) OR ("F_FCST_W"."lv1_prod_list_en_name" IS NULL)) AND (("F_FCST_W"."lv2_prod_list_en_name" IN ('ALL', 'Smart Phone')) OR ("F_FCST_W"."lv2_prod_list_en_name" IS NULL)) AND ("F_FCST_W"."lv3_prod_list_en_name" = 'Mate series') AND ("F_FCST_W"."plan_id" = '20180205') AND ((CASE WHEN ("F_FCST_W"."region_org_en_name" IN ('China Region', 'Consumer BG Greater China Region', 'Greater China Device Business Dept')) THEN ' Greater China Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('CEE & Nordic European Device Business Dept', 'CEE & Nordic European Region')) THEN 'CEE & Nordic European Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('Canada Device Business Dept', 'Canada Rep Office')) THEN 'Canada Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('India Device Business Dept', 'India Rep Office')) THEN 'India Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('Latin America Device Business Dept', 'Latin America Region')) THEN 'Latin America Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('Russia Device Business Dept', 'Russia Rep Office')) THEN 'Russia Device Business Dept' WHEN ("F_FCST_W"."region_org_en_name" IN ('Southern Pacific Device Business Dept', 'Southern Pacific Region')) THEN 'Southern Pacific Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('Southern-East Asia Device Business Dept', 'Southern-East Asia Region')) THEN 'Southern-East Asia Device Business Dept ' WHEN ("F_FCST_W"."region_org_en_name" IN ('West European Device Business Dept', 'West European Region')) THEN 'West European Device Business Dept ' ELSE "F_FCST_W"."region_org_en_name" END) = ' Greater China Device Business Dept'))
GROUP BY 1

主要改了两处 :

1, 将userid提到里面

2, 将

  ( A.lv5_prod_list_code = b.prod_model or b.prod_model = 'all' ) and                                     
    ( A.lv4_prod_list_code = b.prod_family or b.prod_family = 'all' ) and                                  
    ( A.lv3_prod_list_code = b.prod_area or b.prod_area = 'all' ) and                                        
    ( A.lv1_prod_list_code = b.prod_line or b.prod_line = 'all' ) and                                        
( A.REGION_ORG_CODE = b.REGION_ORG_CODE or b.REGION_ORG_CODE = 'ALL' )

替换为:

    (CASE WHEN A.lv5_prod_list_code = b.prod_model THEN 1
WHEN b.prod_model = 'all' THEN 1 ELSE 2
END) = 1   and
                    (CASE WHEN  A.lv4_prod_list_code = b.prod_family THEN 1
WHEN b.prod_family = 'all' THEN 1 ELSE 2
END) = 1   and
 (CASE WHEN  A.lv3_prod_list_code = b.prod_area THEN 1
WHEN b.prod_area = 'all' THEN 1 ELSE 2
END) = 1   and
        (CASE WHEN  A.lv1_prod_list_code = b.prod_line THEN 1
WHEN b.prod_line = 'all' THEN 1 ELSE 2
END) = 1   and                           
       (CASE WHEN  A.REGION_ORG_CODE = b.REGION_ORG_CODE THEN 1
WHEN b.REGION_ORG_CODE = 'ALL' THEN 1 ELSE 2
END) = 1
vieyahn2017 commented 5 years ago

知识变现----亿级数据,秒级返回(presto与FI的整合)
日期:2017-09-20 16:20 浏览:1065 回复:14

我掌握的方法

目前,FI hive的查询主要是有两种方式,beeline及spark beeline, 这两种底层是用的MR及Spark, 它们虽然能够处理TB、PB级别的数据,但查询的性能是不能忍受。简单的查询都要分钟级别,这让人情何以堪。终于有一个神器面世了,presto出现了。它让我们有了用传统DB的感觉,大多数情形可秒级返回。具体机制及原理,我们以后再说,这个东东虽然出现了一两年了,但与Fi的整合及引入,目前还鲜有所闻。

星云大数据项目组。通过探索及不断的google,终于成功引入presto,经过评审,目前我们已经在生产环境上线。以前用spark要20-40秒的查询,现在只要1-2秒就可以了。好东西当然要分享变现。现在将这些贡献给大家。

如对你有用,请点赞http://3ms.huawei.com/hi/public/js/ewebeditor/skin/hi/emoticons/small/0.gif

我变现故事:

引入presto,与FI整合,有效地提高性能。

相关资源下载见附件

  1. 安装jdk1.8.0_144, presto需要这个版本的jdk(假设安装在/data02/jdk1.8.0_144)

a. 运行如下命令设置环境变量

export JAVA_HOME=/data02/jdk1.8.0_144

export PATH=$JAVA_HOME/bin:$PATH

  1. 安装presto的主节点

a. 解压(假设解压路径为/data01/research/presto-server-0.184)

b. 更改配置

coordinator=true(表示是主节点,如不是主节点,这个改为false)

node-scheduler.include-coordinator=true(表示同样含数据节点, 简单测试直接设置为true,)

http-server.http.port=8700(port,占用的port)

query.max-memory=1GB

query.max-memory-per-node=500MB

discovery-server.enabled=true

discovery.uri=http://10.22.61.227:8700(主节点的ip,port)

要改的东东:

-Djava.security.krb5.conf=/data02/research/nifi/keytab/krb5.conf(FI对应的krb 文件)

-Djava.security.auth.login.config=/data02/research/nifi/keytab/jaas.conf(fi的对应的jaas.conf,参考下载附件)

要改的:

node.environment=sit(服务名)

node.id=ffffffff-ffff-ffff-ffff-ffffffffffff(节点id)

node.data-dir=/data01/research/prestodata(节点数据目录)

connector.name=hive-hadoop2

hive.metastore.uri=thrift://10.22.60.245:21088,thrift://10.22.60.188:21088

hive.config.resources=/data01/FusionInsight/client/HDFS/hadoop/etc/hadoop/hdfs-site.xml,/data02/config/core-site.xml,/data01/FusionInsight/client/Hive/config/hive-site.xml

hive.metastore.client.principal=dmp_operator2@HADOOP.COM

hive.metastore.client.keytab=/data01/FusionInsight/user.keytab

hive.hdfs.authentication.type=KERBEROS

hive.hdfs.presto.keytab=/data01/FusionInsight/user.keytab

hive.hdfs.presto.principal=dmp_operator2@HADOOP.COM

hive.metastore.service.principal=hive/hadoop.hadoop.com@HADOOP.COM

hive.metastore.authentication.type=KERBEROS

标黄的根据Fi的实际配置更改。

  1. 启动主节点

在这下面运行

./launcher run

(第一次这样启动,可以发现配置有没有问题,没有问题的话,就按start方式启动)

./launcher start

(服务方式)

这时可以通过

http://10.22.61.227:8700/

这样的url访问了。

  1. 安装yanagishima

a. 解压

b. 更改配置


presto.datasources=dev-presto,uat-presto

presto.coordinator.server.dev-presto=http://10.22.61.227:8700

presto.redirect.server.dev-presto=http://10.22.61.227:8700

presto.coordinator.server.uat-presto=http://10.63.53.126:8700

presto.redirect.server.uat-presto=http://10.63.53.126:8700

catalog.dev-presto=hive

schema.dev-presto=default

catalog.uat-presto=hive

schema.uat-presto=default

按需要更改,我这是配置了两个环境的presto(sit/uat)

按如下命令启动服务:

nohup bin/yanagishima-start.sh >y.log 2>&1 &

这时可以通过

http://10.22.61.227:8701/

这样的url访问了。

具体使用方式参考:

参考资源:

https://github.com/wyukawa/yanagishima

https://prestodb.io/docs/current/

vieyahn2017 commented 5 years ago

yanagishima/yanagishima https://github.com/yanagishima/yanagishima

Code Issues 15 Pull requests 0 Projects 0 Wiki Insights Web UI for Presto, Hive, Elasticsearch, SparkSQL

vieyahn2017 commented 5 years ago

Presto查询优化

2018年01月17日 09:53:25 叫我小名 阅读数:5377

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/FreeFishLy/article/details/79081764
简书同步发布:https://www.jianshu.com/p/f435ce79c966

Presto是一个开源的分布式SQL查询引擎,适用于交互式分析查询,数据量支持GB到PB字节。查询语言是类ANSI SQL语句。笔者在多个项目中用到Presto做即席查询,总结了一些优化措施。

一、数据存储

合理设置分区 与Hive类似,Presto会根据元信息读取分区数据,合理的分区能减少Presto数据读取量,提升查询性能。

使用列式存储 Presto对ORC文件读取做了特定优化,因此在Hive中创建Presto使用的表时,建议采用ORC格式存储。相对于Parquet,Presto对ORC支持更好。

使用压缩 数据压缩可以减少节点间数据传输对IO带宽压力,对于即席查询需要快速解压,建议采用snappy压缩

预先排序 对于已经排序的数据,在查询的数据过滤阶段,ORC格式支持跳过读取不必要的数据。比如对于经常需要过滤的字段可以预先排序。

INSERT INTO table nation_orc partition(p) SELECT * FROM nation SORT BY n_name;

如果需要过滤n_name字段,则性能将提升。

SELECT count(*) FROM nation_orc WHERE n_name=’AUSTRALIA’;

二、查询SQL优化

只选择使用必要的字段 由于采用列式存储,选择需要的字段可加快字段的读取、减少数据量。避免采用*读取所有字段。

[GOOD]: SELECT time,user,host FROM tbl
[BAD]:  SELECT * FROM tbl

过滤条件必须加上分区字段 对于有分区的表,where语句中优先使用分区字段进行过滤。acct_day是分区字段,visit_time是具体访问时间

[GOOD]: SELECT time,user,host FROM tbl where acct_day=20171101
[BAD]:  SELECT * FROM tbl where visit_time=20171101

Group By语句优化 合理安排Group by语句中字段顺序对性能有一定提升。将Group By语句中字段按照每个字段distinct数据多少进行降序排列。

[GOOD]: SELECT GROUP BY uid, gender
[BAD]:  SELECT GROUP BY gender, uid

Order by时使用Limit Order by需要扫描数据到单个worker节点进行排序,导致单个worker需要大量内存。如果是查询Top N或者Bottom N,使用limit可减少排序计算和内存压力。

[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100
[BAD]:  SELECT * FROM tbl ORDER BY time

使用近似聚合函数 Presto有一些近似聚合函数,对于允许有少量误差的查询场景,使用这些函数对查询性能有大幅提升。比如使用approx_distinct() 函数比Count(distinct x)有大概2.3%的误差。

SELECT approx_distinct(user_id) FROM access

用regexp_like代替多个like语句 Presto查询优化器没有对多个like语句进行优化,使用regexp_like对性能有较大提升

[GOOD]
SELECT
  ...
FROM
  access
WHERE
  regexp_like(method, 'GET|POST|PUT|DELETE')

[BAD]
SELECT
  ...
FROM
  access
WHERE
  method LIKE '%GET%' OR
  method LIKE '%POST%' OR
  method LIKE '%PUT%' OR
  method LIKE '%DELETE%'

使用Join语句时将大表放在左边 Presto中join的默认算法是broadcast join,即将join左边的表分割到多个worker,然后将join右边的表数据整个复制一份发送到每个worker进行计算。如果右边的表数据量太大,则可能会报内存溢出错误。

[GOOD] SELECT ... FROM large_table l join small_table s on l.id = s.id
[BAD] SELECT ... FROM small_table s join large_table l on l.id = s.id

使用Rank函数代替row_number函数来获取Top N 在进行一些分组排序场景时,使用rank函数性能更好

[GOOD]
SELECT checksum(rnk)
FROM (
  SELECT rank() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk
  FROM lineitem
) t
WHERE rnk = 1

[BAD]
SELECT checksum(rnk)
FROM (
  SELECT row_number() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk
  FROM lineitem
) t
WHERE rnk = 1

三、无缝替换Hive表

如果之前的hive表没有用到ORC和snappy,那么怎么无缝替换而不影响线上的应用: 比如如下一个hive表:

CREATE TABLE bdc_dm.res_category(
channel_id1 int comment '1级渠道id',
province string COMMENT '省',
city string comment '市', 
uv int comment 'uv'
)
comment 'example'
partitioned by (landing_date int COMMENT '日期:yyyymmdd')
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';

建立对应的orc表

CREATE TABLE bdc_dm.res_category_orc(
channel_id1 int comment '1级渠道id',
province string COMMENT '省',
city string comment '市', 
uv int comment 'uv'
)
comment 'example'
partitioned by (landing_date int COMMENT '日期:yyyymmdd')
row format delimited fields terminated by '\t'
stored as orc 
TBLPROPERTIES ("orc.compress"="SNAPPY");

先将数据灌入orc表,然后更换表名

insert overwrite table bdc_dm.res_category_orc partition(landing_date)
select * from bdc_dm.res_category where landing_date >= 20171001;

ALTER TABLE bdc_dm.res_category RENAME TO bdc_dm.res_category_tmp;
ALTER TABLE bdc_dm.res_category_orc RENAME TO bdc_dm.res_category;

其中res_category_tmp是一个备份表,若线上运行一段时间后没有出现问题,则可以删除该表。

注意事项 ORC和Parquet都支持列式存储,但是ORC对Presto支持更好(Parquet对Impala支持更好) 对于列式存储而言,存储文件为二进制的,对于经常增删字段的表,建议不要使用列式存储(修改文件元数据代价大)。对比数据仓库,dwd层建议不要使用ORC,而dm层则建议使用 若在使用Presto和Hive过程中有任何问题,欢迎给我留言!