Function() OVER ([PARTITION BY <column1, column2..>] [ORDER BY <column3..>] [window_clause])
Functions
Aggregate Functions
desc
COUNT
SUM
MIN
MAX
AVG
Ranking Functions
DataFrame API
desc
ROW_NUMBER()
没有重复值的排序。 1.实现分页。 2.对数据进行分组并取每个分组中的TopN数据
RANK()
生成数据项在分组中的排名,跳跃排序。 两个第二名下来就是第四名, 排名相等会在名次中留下空位
DENSE_RANK()
生成数据项在分组中的排名,连续排序。 两个第二名仍然跟着第三名,排名相等不会留下空位
PERCENT_RANK()
分组内当前行(RANK值-1) / (分组内总行数-1)
NTILE(n)
将分组数据按照顺序切分成n片,返回当前切片值
# [partition by col1]可选,即不进行分组
ROW_NUMBER() OVER([PARTITION BY col1] ORDER BY col2)
RANK() OVER([PARTITION BY col1] ORDER BY col2)
DENSE_RANK() OVER([PARTITION BY col1] ORDER BY col2)
排序函数(降序)默认空值或NULL值排在最后
# NULLS LAST
ORDER BY salary DESC NULLS LAST
排序函数(升序)默认空值或NULL值排在最前
# NULLS FIRST
ORDER BY salary ASC NULLS FIRST
Ranking functions example
Dataset<Row> sqlDF = spark
.sql(
"SELECT " +
"depname, " +
"salary, " +
"rank() OVER (PARTITION BY depname ORDER BY salary DESC) as rank, " +
"dense_rank() OVER (PARTITION BY depname ORDER BY salary DESC) as dense_rank, " +
"percent_rank() OVER (PARTITION BY depname ORDER BY salary DESC) as percent_rank, " +
"row_number() OVER (PARTITION BY depname ORDER BY salary DESC) as rowNo " +
"FROM empsalary"
);
sqlDF.show();
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
clause
desc
PRECEDING
往前
FOLLOWING
往后
CURRENT ROW
当前行
UNBOUNDED
无界限(起点或终点)
UNBOUNDED PRECEDING
表示从前面的起点
UNBOUNDED FOLLOWING
表示到后面的终点
ROWS与RANGE之间的区别
key
Desc
ROWS
定义当前行的固定前后记录,行取决于窗口的ORDER BY从句(在物理层面定义窗口有多少行)
RANGE
行取决于窗口的ORDER BY的重复行(在逻辑层面定义窗口由多少行)
以sum为例子
Dataset<Row> sqlDF = spark
.sql(
"SELECT " +
"depname, " +
"salary, " +
// 默认从起点到当前所有重复行
"sum(salary) OVER (PARTITION BY depname ORDER BY salary ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) salary_1, " +
// 默认从起点到当前行
"sum(salary) OVER (PARTITION BY depname ORDER BY salary ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) salary_2," +
// 不指定ORDER BY,则将分组内所有值累加: sum(salary) OVER (PARTITION BY depname ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
"sum(salary) OVER (PARTITION BY depname) as salary_3 " +
"FROM empsalary"
);
sqlDF.show();
传统聚合函数
基本语法
Functions
1.实现分页。
2.对数据进行分组并取每个分组中的TopN数据
两个第二名下来就是第四名, 排名相等会在名次中留下空位
两个第二名仍然跟着第三名,排名相等不会留下空位
/**
Partition By
Order By
Frame clause
ROWS与RANGE之间的区别
output
Over函数不带任何参数,默认补全窗口规范
Analytic/Aggregate function ORDER BY后缺少窗口从句条件,默认补全窗口规范
Rank functions ORDER BY后缺少窗口从句条件
过滤提取
References
SQL Server中的窗口函数
Hive分析函数和窗口函数
SQL Server Windowing Functions: ROWS vs. RANGE