4. Aggregation - Githubissues

mind1949 commented 5 years ago

Intro

Aggregation is one of those capabilities that really make you appreciate the power of relational database systems. It allows you to move beyond merely persisting your data, into the realm of asking truly interesting questions that can be used to inform decision making. This category covers aggregation at length, making use of standard grouping as well as more recent window functions. If you struggle with these questions, I strongly recommend Learning SQL, by Alan Beaulieu and SQL Cookbook by Anthony Molinaro. In fact, get the latter anyway - it'll take you beyond anything you find on this site, and on multiple different database systems to boot.

Q1: count the number of facilities

For our first foray into aggregates, we're going to stick to something simple. We want to know how many facilities exist - simply produce a total count.

select count(*) as count from cd.facilities;

Q2: count the number of enpensive faciliteis

Produce a count of the number of facilities that have a cost to guests of 10 or more.

select count(*)
    from cd.facilities
    where guestcost >= 10;

Q3: Count the number of recommendations each member makes.

Produce a count of the number of recommendations each member has made. Order by member ID.

select recommendedby, count(*)
    from cd.members
    where recommendedby is not null
    group by recommendedby
order by recommendedby;

--or--
select recommendedby, count(*)
    from cd.members
    where not recommendedby is null
    group by recommendedby
order by recommendedby;

Q4: list the total slots booked per facility

Produce a list of the total number of slots booked per facility. For now, just produce an output table consisting of facility id and slots, sorted by facility id.

select facid, sum(slots) as "TotalSlot"
    from cd.bookings
    group by facid
order by facid;

Q5: list the total slots booked per facility in a given month

Produce a list of the total number of slots booked per facility in the month of September 2012. Produce an output table consisting of facility id and slots, sorted by the number of slots.

select facid, sum(slots) as "Total Slots"
    from cd.bookings
    where starttime between '2012-09-01'  and '2012-10-1'
    group by facid -- group by必须放在被过滤后的表格后面
order by "Total Slots";

Q6: list total slots booked per facility per month

Produce a list of the total number of slots booked per facility per month in the year of 2012. Produce an output table consisting of facility id and slots, sorted by the id and month.

-- extract(month from starttime) -- 

select facid, extract(month from starttime) as month, sum(slots) as "Total Slots"
    from cd.bookings
    where starttime between '2012-01-01' and '2013-01-01'
    group by facid, month
order by facid, month;

解析

这里因为要select facid, 所以用group by分组的时候也要用facid分组,否则会报错,原因是: group by就是将一个数据集划分为若干个小区域, 然后对若干个小区域进行数据处理; select中出现的字段必须要么是在group by中作为分组的依据,要么被包含在聚集函数内部

Q7: find the count of members who have made at least one booking

Find the total number of members who have made at least one booking.

select count(distinct memid)
    from cd.bookings
    where memid is not null;

Q8: lists facilities with more than 1000 slots booked

Produce a list of facilities with more than 1000 slots booked. Produce an output table consisting of facility id and hours, sorted by facility id.

select facid, sum(slots) as "Total Slots"
    from cd.bookings
    group by facid
    having sum(slots) > 1000
order by facid;

解析这里使用having而不能使用where是因为:

Where 是一个约束声明，使用Where约束来自数据库的数据，Where是在结果返回之前起作用的，Where中不能使用聚合函数。 Having是一个过滤声明，是在查询返回结果集以后对查询结果进行的过滤操作，在Having中可以使用聚合函数。在查询过程中聚合语句(sum,min,max,avg,count)要比having子句优先执行。而where子句在查询过程中执行优先级高于聚合语句。

Q9: find the total revenue of each facility

Produce a list of facilities along with their total revenue. The output table should consist of facility name and revenue, sorted by revenue. Remember that there's a different cost for guests and members!

select f.name,
    sum(b.slots * case
            when b.memid = 0 then f.guestcost
            else f.membercost
       end) as revenue
    from cd.facilities as f
        inner join cd.bookings as b
            on f.facid = b.facid
    group by f.name
order by revenue;

sql语句的执行顺序

不了解这个那么在写一个sql语句时,就不能有一个清晰的逻辑. 就不能理解为什么having能够执行聚集函数而where不行,不能理解为什么若有group by的话,为什么select中的列必须是group by指定的,或者是在聚集函数中的 SQL SELECT语句的执行顺序： from子句组装来自不同数据源的数据； where子句基于指定的条件对记录行进行筛选； group by子句将数据划分为多个分组；使用聚集函数进行计算；使用having子句筛选分组；计算所有的表达式；使用order by对结果集进行排序； select 集合输出。

Q10: find a facility with total revenue less than 1000

Produce a list of facilities with a total revenue less than 1000. Produce an output table consisting of facility name and revenue, sorted by revenue. Remember that there's a different cost for guests and members!

select name, revenue
    from(select f.name as name,
            sum(b.slots*case
                when b.memid = 0 then f.guestcost
                else f.membercost
               end) as revenue
            from cd.facilities as f
            inner join cd.bookings as b
                on f.facid = b.facid
            group by f.name
        ) as revenues
    where revenue < 1000
order by revenue;

Q11: outpu the facility with the hieghtest number of slots booked

Output the facility id that has the highest number of slots booked. For bonus points, try a version without a LIMIT clause. This version will probably look messy!


select facid, sum(slots) as "Total Slots"
            from cd.bookings
            group by facid
order by "Total Slots" 
limit 1;

--使用通用表达式书写--
with sum as (select facid, sum(slots) as totalslots
    from cd.bookings
    group by facid
)
select facid, totalslots 
    from sum
    where totalslots = (select max(totalslots) from sum);

Q12: List the total slots booked per facility per month, part 2

Produce a list of the total number of slots booked per facility per month in the year of 2012. In this version, include output rows containing totals for all months per facility, and a total for all months for all facilities. The output table should consist of facility id, month and slots, sorted by the id and month. When calculating the aggregated values for all months and all facids, return null values in the month and facid columns.

--原始的写法
with bookings as (select facid, extract(month from starttime) as month, slots
                    from cd.bookings
                    where starttime between '2012-01-01' and '2013-01-01'
                )
select facid, month, sum(slots) as slots
    from bookings
    group by facid, month
union
select facid, null as month, sum(slots)
    from bookings
    group by facid
union
select null as facid, null as month, sum(slots)
    from bookings
order by facid, month

-- 使用rollup函数简化
select facid, extract(month from starttime) as month, sum(slots) as slots
    from cd.bookings
    where
        starttime >= '2012-01-01'
        and starttime < '2013-01-01'
    group by rollup(facid, month)
order by facid, month;

Q13: list the total hours booked per named facility

Produce a list of the total number of hours booked per facility, remembering that a slot lasts half an hour. The output table should consist of the facility id, name, and hours booked, sorted by facility id. Try formatting the hours to two decimal places.

--使用格式化函数格式化浮点型数据
--pg里没有直接保留几位有效数字的功能,只能通过转换为字符串进行格式化)
select b.facid, f.name, trim(to_char(sum(b.slots*0.5), '999D99')) as "Total Hours"
    from cd.bookings as b
    inner join cd.facilities as f
        on b.facid = f.facid
    group by b.facid, f.name
order by b.facid

Q14: list each member's first booking after septmember 1st 2012

Produce a list of each member name, id, and their first booking after September 1st 2012. Order by member ID.

select m.surname, m.firstname, m.memid, min(b.starttime)
    from cd.members as m
    inner join cd.bookings as b
        on m.memid = b.memid
    where b.starttime > '2012-09-01'
    group by m.surname, m.firstname, m.memid
order by m.memid;

Q15: produce a list of member name, with each row containing the total member count

Produce a list of member names, with each row containing the total member count. Order by join date.

--普通模式
select (select count(*) from cd.members) as count,
    firstname, surname
    from cd.members
order by joindate;

--使用窗口函数贱货操作
select count(*) over(), firstname, surname
    from cd.members
order by joindate

Q16: produce a numberd list of members

Produce a monotonically increasing numbered list of members, ordered by their date of joining. Remember that member IDs are not guaranteed to be sequential.

select row_number() over(order by joindate), firstname, surname
    from cd.members
order by joindate

解析

row_number用于为窗口函数排序后的集合添加序号,序号是递增不重复的 rank负责为窗口函数排序后的集合添加排名,不同于row_num,rank需要考虑被排序的那一栏是否相等

Q17: output the facility id that has highest number of slots booked

Output the facility id that has the highest number of slots booked. Ensure that in the event of a tie, all tieing results get output.

--不建议
select facid, sum(slots) as total
    from cd.bookings
order by total
limit 1;

--使用ranke函数
select facid, total
    from (
        select facid, sum(slots) as total, rank() over(order by sum(slots) desc) as rank
            from cd.bookings
            group by facid
    ) as bookings
    where rank = 1;

Q18: rank members by hours(rounded) used

Produce a list of members, along with the number of hours they've booked in facilities, rounded to the nearest ten hours. Rank them by this rounded figure, producing output of first name, surname, rounded hours, rank. Sort by rank, surname, and first name.


select m.firstname,
    m.surname,
    round(sum(b.slots)/2.0/10)*10 as hours,
    rank() over(order by round(sum(b.slots)/2.0/10)*10 desc) as rank
    from cd.members as m
        inner join cd.bookings as b
            on m.memid = b.memid
    group by m.firstname, m.surname
order by rank, surname, firstname

Q19: find the top three revenue generating facilities

Produce a list of the top three revenue generating facilities (including ties). Output facility name and rank, sorted by rank and facility name.

with revenues as (select f.name,
        sum(case
        when b.memid = 0 then
            b.slots*f.guestcost
        else
            b.slots*f.membercost
        end) as revenue
        from cd.facilities as f
        inner join cd.bookings as b
            on f.facid = b.facid
        group by f.name)
select name, rank
    from(select name, rank() over(order by revenue desc) as rank
        from revenues) as ranks
    where rank in (1, 2, 3)
order by rank, name

Q20: classify facilities by value

Classify facilities into equally sized groups of high, average, and low based on their revenue. Order by classification and facility name.

select name, case when class=1 then 'high'
        when class=2 then 'average'
        else 'low'
        end revenue
    from (
        select facs.name as name, ntile(3) over (order by sum(case
                when memid = 0 then slots * facs.guestcost
                else slots * membercost
            end) desc) as class
        from cd.bookings bks
        inner join cd.facilities facs
            on bks.facid = facs.facid
        group by facs.name
    ) as subq
order by class, name;

解析

使用ntile进行分类

Q21: calculate the payback for time each facility

Based on the 3 complete months of data so far, calculate the amount of time each facility will take to repay its cost of ownership. Remember to take into account ongoing monthly maintenance. Output facility name and payback time in months, order by facility name. Don't worry about differences in month lengths, we're only looking for a rough value here!

select  facs.name as name,
    facs.initialoutlay/((sum(case
            when memid = 0 then slots * facs.guestcost
            else slots * membercost
        end)/3) - facs.monthlymaintenance) as months
    from cd.bookings bks
    inner join cd.facilities facs
        on bks.facid = facs.facid
    group by facs.facid
order by name;

Q22: calculate a rolling average of total revenue

For each day in August 2012, calculate a rolling average of total revenue over the previous 15 days. Output should contain date and revenue columns, sorted by the date. Remember to account for the possibility of a day having zero revenue. This one's a bit tough, so don't be afraid to check out the hint!

--这个答案不是我自己想的, 我只是理解了参考答案的思路
select  dategen.date,
    (
        -- correlated subquery that, for each day fed into it,
        -- finds the average revenue for the last 15 days
        select sum(case
            when memid = 0 then slots * facs.guestcost
            else slots * membercost
        end) as rev

        from cd.bookings bks
        inner join cd.facilities facs
            on bks.facid = facs.facid
        where bks.starttime > dategen.date - interval '14 days'
            and bks.starttime < dategen.date + interval '1 day'
    )/15 as revenue
    from
    (
        -- generates a list of days in august
        select  cast(generate_series(timestamp '2012-08-01',
            '2012-08-31','1 day') as date) as date
    )  as dategen
order by dategen.date;

mind1949 commented 5 years ago

having 与 where 的区别是什么? 什么时候用having? 什么时候用where?

mind1949 commented 5 years ago

group by 是分组,但是跟具体来说它做了什么? 如果不能像理解select/where/inner join/_ outer join那样理解一个sql语句执行过程中group by在什么顺序执行了什么操作,就不能清晰地理解.清晰的使用

mind1949 commented 5 years ago

having 与 where 的区别是什么? 什么时候用having? 什么时候用where?

Where 是一个约束声明，使用Where约束来自数据库的数据，Where是在结果返回之前起作用的，Where中不能使用聚合函数。 Having是一个过滤声明，是在查询返回结果集以后对查询结果进行的过滤操作，在Having中可以使用聚合函数。在查询过程中聚合语句(sum,min,max,avg,count)要比having子句优先执行。而where子句在查询过程中执行优先级高于聚合语句。

mind1949 commented 5 years ago

group by 是分组,但是跟具体来说它做了什么? 如果不能像理解select/where/inner join/_ outer join那样理解一个sql语句执行过程中group by在什么顺序执行了什么操作,就不能清晰地理解.清晰的使用

group by就是将一个数据集划分为若干个小区域, 然后对若干个小区域进行数据处理; select中出现的字段必须要么是在group by中作为分组的依据,要么被包含在聚集函数内部

mind1949 commented 5 years ago

sql执行顺序是什么? 不了解这个那么在写一个sql语句时,就不能有一个清晰的逻辑. 就不能理解为什么having能够执行聚集函数而where不行,不能理解为什么若有group by的话,为什么select中的列必须是group by指定的,或者是在聚集函数中的

mind1949 commented 5 years ago

sql执行顺序是什么? 不了解这个那么在写一个sql语句时,就不能有一个清晰的逻辑. 就不能理解为什么having能够执行聚集函数而where不行,不能理解为什么若有group by的话,为什么select中的列必须是group by指定的,或者是在聚集函数中的

SQL SELECT语句的执行顺序：

from子句组装来自不同数据源的数据；
where子句基于指定的条件对记录行进行筛选；
group by子句将数据划分为多个分组；
使用聚集函数进行计算；
使用having子句筛选分组；
计算所有的表达式；
使用order by对结果集进行排序；
select 集合输出。

mind1949 commented 5 years ago

group by的 rollup 与cube和grouping set是什么?有什么作用?

mind1949 commented 5 years ago

什么是窗口函数?

概念的理解一定要清晰,不能大而化之

mind1949 commented 5 years ago

什么是窗口函数?

概念的理解一定要清晰,不能大而化之

聚合函数的作用于由 GROUP BY 子句聚合的组，而窗口函数则作用于一个窗口，这里，窗口是由一个 OVER 子句定义的多行记录. 聚合函数也可以配合over子句作为窗口函数使用

mind1949 commented 5 years ago

第四章练习题完成!

mind1949 commented 5 years ago

加油!

mind1949 / pgexercises

4. Aggregation #4

Intro

Q1: count the number of facilities

Q2: count the number of enpensive faciliteis

Q3: Count the number of recommendations each member makes.

Q4: list the total slots booked per facility

Q5: list the total slots booked per facility in a given month

Q6: list total slots booked per facility per month

Q7: find the count of members who have made at least one booking

Q8: lists facilities with more than 1000 slots booked

Q9: find the total revenue of each facility

Q10: find a facility with total revenue less than 1000

Q11: outpu the facility with the hieghtest number of slots booked

Q12: List the total slots booked per facility per month, part 2

Q13: list the total hours booked per named facility

Q14: list each member's first booking after septmember 1st 2012

Q15: produce a list of member name, with each row containing the total member count

Q16: produce a numberd list of members

Q17: output the facility id that has highest number of slots booked

Q18: rank members by hours(rounded) used

Q19: find the top three revenue generating facilities

Q20: classify facilities by value

Q21: calculate the payback for time each facility

Q22: calculate a rolling average of total revenue