Closed larsyencken closed 2 years ago
@edomt Thoughts on this?
This was generally implemented in Tenerife already, but (following @maxroser's advice) I only deleted unused datasets up to the YC period (2019).
@HannahRitchie @maxroser Any thoughts on whether we should extend this to more recent datasets?
@edomt Yes, we should do a search of all datasets not being used and filter them all.
Happy to help with filtering because I probably have a lot of legacy knowledge of any we shouldn't delete.
I think anything under the owid
namespace (uploaded through the Admin) could be deleted like this, but potentially we should be more conservative with bulk imported datasets? Especially if they are the "current" version.
Yes sorry, to clarify: I thought we were only talking about owid
datasets here. Bulk imports should not be deleted.
A draft SQL query that finds variables in BigQuery finds quite a lot:
with var_used_in_charts as (
select chart_id, CAST(variable_id AS INT64) variable_id
from `owid-analytics.prod_owid.charts`
cross join unnest(variable_ids) as variable_id
),
dataset_used_in_charts AS (
SELECT
d.id dataset_id,
any_value(d.name) dataset_name,
any_value(d.createdAt) dataset_created_at,
count(distinct uc.chart_id) n_charts
FROM `prod_mysql.variables` v
inner join prod_mysql.datasets d on (v.datasetId = d.id)
left join var_used_in_charts uc on (v.id = variable_id)
group by dataset_id
),
unused_datasets as (
select *
from dataset_used_in_charts
where n_charts = 0
)
select
dataset_id,
dataset_name,
dataset_created_at,
v.id variable_id,
v.name variable_name
from unused_datasets ud
inner join prod_mysql.variables v on (v.datasetId = ud.dataset_id)
where dataset_created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 180 DAY)
order by dataset_id, variable_id
Roughly 76k variables, which is half of the 150k variables that we have.
A spreadsheet of all the datasets we would might remove looks like this: https://docs.google.com/spreadsheets/d/102G5QuTgzqwGAbCAK0MnKBeu6G_r5iVmFzzQpmd3FqA/edit?usp=sharing
@larsyencken you are not filtering to the owid namespace though, right? Hannah said bulk imports should be excluded
Thanks for sharing @larsyencken! Dataset 65 is listed on the sheet, but it has 1 live chart. Maybe something needs to be adjusted in the query?
I wrote some notes in the next tab about a possible issue with the query -- I'll redo it using chart_dimensions
and see if it fixes the issue. I'll also add the namespace for filtering, though I am curious about which bulk imports have never been used.
I updated the sheet again now with the fixed query, it correctly picks up and excludes dataset 65 and ~200 others that were actually being used: https://docs.google.com/spreadsheets/d/102G5QuTgzqwGAbCAK0MnKBeu6G_r5iVmFzzQpmd3FqA/edit#gid=610936695
As a reminder, the rationale for looking at this is because we are trying to move to a new data model where we automatically republish anything that's non private, making it discoverable and available to any member of the public. That means that the vast majority of these datasets would become public if we migrated them. The only reason not to do this is if we think that these imported-but-unused datasets are actually not in good enough shape to republish.
We also have the option to "archive" them in place. If we did that, we would not be deleting them, but we would be hiding them from the admin, we would not migrate them, and someone would need to manually reenable a specific dataset if we wanted to use it on our site.
@HannahRitchie and @edomt How should I think about big five-year-old bulk imports like the FAO trade matrix from 2017 or the World Bank Atlas of Social Protection from the same year? Is there some cutoff after which we can say, "anything before this probably doesn't meet our data standards" and "if we need it we'll just import again", or do we want to keep and republish all bulk imports ever?
I don't think we want to keep and republish all bulk imports forever. A lot of them won't be up to our current standards, and I don't expect many people to make use of these old versions anyway. I think that bulk-imported datasets without any chart and older than 3 years could be removed. But maybe @HannahRitchie has very good reasons against doing this.
@larsyencken I agree with @edomt and have nothing to add :)
Thank you both! Prioritised it now as "important", since it affects our future API and backporting efforts, and estimated as 3d, although those days might be spread across a week or so.
Taking this over from @larsyencken to finish the last mile. I've updated his list to not delete bulk imported datasets younger than 3 years (non-bulk have limit of 180 days). Changing it to 2 years would only delete un_sdg_2019 and faostat_2020.
@edomt could you please quickly skim over the list here? If it looks ok I'll prepare the migration.
Thank you @Marigold!
All good on my end.
I'm assigning it back to @HannahRitchie and @maxroser because the list is quite long (1400 datasets) and I want to make sure everyone is fully on board with deleting half of our database :)
Thank you very much @Marigold!
These all look fine to delete, but I would suggest we keep all datasets in the namespace clioinfra
for now. Do not delete those ones.
🚨 Let's not delete any of those at the moment.
Instead, it would be really useful to have a list that I could work through myself to check them all. Would that be possible?
The reason is that some of these may actually be in use. For example, we nearly deleted one which is powering a data explorer (as flagged in the internal notes).
Just an FYI for @edomt for later on. We're deleting some datasets here, such as ilostat, qog, and oecd_stat (which I think is fine because they're not being used). But some researchers might really benefit from updated bulk imports of these later (I don't really use them, but possibly Esteban, Bastian etc.).
Just an FYI for @edomt for later on. We're deleting some datasets here, such as ilostat, qog, and oecd_stat (which I think is fine because they're not being used). But some researchers might really benefit from updated bulk imports of these later (I don't really use them, but possibly Esteban, Bastian etc.).
Good to know, thanks – indeed I think that if researchers end up needing them, we'll want a clean import of the most up-to-date version.
The reason is that some of these may actually be in use. For example, we nearly deleted one
Oh shoot, glad you spotted it.
Instead, it would be really useful to have a list that I could work through myself to check them all. Would that be possible?
@HannahRitchie here's the spreadsheet, is it enough or should I prepare something different?
@larsyencken an alternative to deleting datasets would be excluding them from backporting. Might be safer for now and we can revisit deletion / archival in the future.
Ah, ok. So the explorer-powering-route is something like:
We should ideally find a way to follow these links automatically and use them to prevent accidental deletion. Maybe as a first cut we can add these links to BigQuery in the analytics repo and resolve the links that way?
@marigold Agree on archiving instead of deleting, suggest you add an a flag to grapher datasets that archives them. But an archived dataset should not show up in search, and should not be possible to publish a grapher chart from. I.e. it should be totally hidden, and only restorable with some manual SQL.
On Wed, 25 May 2022 at 15:58, Mojmir Vinkler @.***> wrote:
The reason is that some of these may actually be in use. For example, we nearly deleted one https://owid.cloud/admin/datasets/5352
Oh shoot, glad you spotted it.
Instead, it would be really useful to have a list that I could work through myself to check them all. Would that be possible?
@HannahRitchie https://github.com/HannahRitchie here's the spreadsheet, is it enough or should I prepare something different?
@larsyencken https://github.com/larsyencken an alternative to deleting datasets would be excluding them from backporting. Might be safer for now and we can revisit deletion / archival in the future.
— Reply to this email directly, view it on GitHub https://github.com/owid/owid-grapher/issues/1338#issuecomment-1137280220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAIZLBGI236DCU5RLGT3PTVLYWW3ANCNFSM5SBPT72A . You are receiving this because you were mentioned.Message ID: @.***>
@HannahRitchie here's the spreadsheet, is it enough or should I prepare something different?
@Marigold Not sure if it would be possible to also include a link to the dataset as a column so that we can click through them quickly? Going to the grapher to search adds quite a bit of time. If it takes you a bit of time to do that, then it's fine to leave as it is :)
Would it be possible to give others edit access (currently it's view only).
@HannahRitchie my bad, added that link and column Archive (Y/N)
to the spreadsheet.
Agree on archiving instead of deleting, suggest you add an a flag to grapher datasets that archives them. But an archived dataset should not show up in search
@larsyencken good call, that's the safest way. I'm gonna add the new column isArchived
to datasets
in the migration.
and should not be possible to publish a grapher chart from. I.e. it should be totally hidden, and only restorable with some manual SQL.
I'm not gonna do it as part of this task, but I'm setting up a reminder 3 months from now to check how many archived datasets were actually used in a chart.
Thanks a lot @Marigold! I have filled out most of the spreadsheet with Y/N. Just asked a few authors to verify if their datasets can be deleted. Should have it completed in the next day :)
I'm not gonna do it as part of this task, but I'm setting up a reminder 3 months from now to check how many archived datasets were actually used in a chart.
Perhaps we can ask someone else to help with filtering them out, once they're marked in the database as archived. The goal by the end of this cycle is to remove all coverage gaps between grapher as a data source and the ETL, I think doing this on the grapher side will help.
@Marigold – I've added a Y/N to all of the owid
datasets now. I haven't for the bulk imports – not sure if you need this?
If there's anything else you need, let me know. Thank you.
Reassigning to @Marigold now, but I have a feeling this perhaps got done already?
That's right, it dragged because of the migration on live server, but that's done already.
@Marigold The last checkbox here was to hide archived data from the chart builder -- did that get done too?
Shoot, sorry, reopening. There's an issue for it, but it hasn't been picked yet (probably needs triage).
Proposal
We could enumerate, manually verify, then delete all datasets >6m old which are not used in any chart.
Scope
Delete in a migrationArchive in a migrationRationale
We would like to move the "source of truth" of our data from MySQL to the ETL over time, which involves backporting old datasets for which we have no recipe from MySQL to the ETL, to be republished to the general public. We do not want to ship data to the public that we're not happy with, and we do not want to QA data that by definition wasn't important enough to use. So we should delete it instead.
Related issues
1337