Closed cdolfi closed 8 months ago
@cdolfi wait what did the error mean, and why does this fix it? I'm kinda lost without doing a deep dive on the problem
@JamesKunstle So pretty much the error you noted happened bc there is no packages that are out of date so the set intervals used before break. I could (and started with) solving for the direct case where all of the packages are up to date, but I realized there was multiple different cases where this would break. Pretty much any scenario where there are no packages more than a year out of date. Lmk if its still confusing, bc its a little abstract/complicated
@cdolfi Ah okay, I'll look again with this understanding
@cdolfi I think the cleanest way to do this is actually in SQL to avoid Pandas-related bugs. lmk what you think
with last_collections_cte as (
-- cte: For each repo, calculate the most recent collection date.
select
rdl.repo_id,
max(rdl.data_collection_date) as latest_collection_date
from
augur_data.repo_deps_libyear rdl
group by rdl.repo_id
)
select
*,
case -- categorize the dependency out-of-date-edness;
WHEN rdl.libyear >= 1.0 THEN '12 months or more' -- 1.0 is one year
WHEN rdl.libyear < 1.0 and rdl.libyear >= 0.5 THEN '<12 months, >=6 months' -- 0.5 = 1/2 year, 6mos
when rdl.libyear < 0.5 and rdl.libyear >= 0.0833 then '<6 months, >=1 month' -- 0.0833rep = 1/12 year, 1mos
when rdl.libyear < 0.0833 and rdl.libyear > 0 then '<1 month'
when rdl.libyear = 0 then 'Up to date'
else 'Unclear version history'
END as dep_age
from augur_data.repo_deps_libyear rdl,
last_collections_cte lc
where
rdl.repo_id in (36113) -- augur repo_id
and rdl.libyear >= 0
and lc.repo_id = rdl.repo_id
and lc.latest_collection_date = rdl.data_collection_date -- only use most recent collected data per repo
To fix for the error listed I decided to go ahead and work on all potential edge cases. If you think of a better way to do this lmk