gvwilson / sql-tutorial

The Querynomicon: An Introduction to SQL for Wary Data Scientists
https://gvwilson.github.io/sql-tutorial/
Other
418 stars 35 forks source link

progress might be miscalculated? #38

Closed 2x2xplz closed 5 months ago

2x2xplz commented 5 months ago

for query 67, the progress column advances linearly, every row adds exactly 0.0769 to the previous row. If you are counting progress as the number of months which have passed, then this is correct, however, given that it's next to the num_done column it appears incorrect. I think most readers would expect "progress" to reflect the % of experiments completed out of the total (50).

I'd suggest refactoring the query to add an extra column and to re-label the columns to clarify the intent of each:

with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)

select
    ym,
    num,
    sum(num) over (order by ym) as num_done,
    (sum(num) over (order by ym) * 1.00) / (select sum(num) from ym_num) as completed_progress, 
    cume_dist() over (order by ym) as linear_progress
from ym_num

Now, for example, the first row's completed_progress is 4%, representing 2 out of 50 experiments completed.