Delete or archive old datasets that are not used in any chart

larsyencken commented 2 years ago

Proposal

We could enumerate, manually verify, then delete all datasets >6m old which are not used in any chart.

Scope

[x] Enumerate datasets to delete (>6 month old never used non-bulk OR >3y old never used bulk)
[x] Manually verify
[x] ~~Delete in a migration~~ Archive in a migration
[x] Hide archived data from the chart builder admin

Rationale

We would like to move the "source of truth" of our data from MySQL to the ETL over time, which involves backporting old datasets for which we have no recipe from MySQL to the ETL, to be republished to the general public. We do not want to ship data to the public that we're not happy with, and we do not want to QA data that by definition wasn't important enough to use. So we should delete it instead.

Related issues

larsyencken commented 2 years ago

@edomt Thoughts on this?

edomt commented 2 years ago

This was generally implemented in Tenerife already, but (following @maxroser's advice) I only deleted unused datasets up to the YC period (2019).

@HannahRitchie @maxroser Any thoughts on whether we should extend this to more recent datasets?

HannahRitchie commented 2 years ago

@edomt Yes, we should do a search of all datasets not being used and filter them all.

Happy to help with filtering because I probably have a lot of legacy knowledge of any we shouldn't delete.

danielgavrilov commented 2 years ago

I think anything under the owid namespace (uploaded through the Admin) could be deleted like this, but potentially we should be more conservative with bulk imported datasets? Especially if they are the "current" version.

HannahRitchie commented 2 years ago

Yes sorry, to clarify: I thought we were only talking about owid datasets here. Bulk imports should not be deleted.

larsyencken commented 2 years ago

A draft SQL query that finds variables in BigQuery finds quite a lot:

with var_used_in_charts as (
    select chart_id, CAST(variable_id AS INT64) variable_id
    from `owid-analytics.prod_owid.charts`
    cross join unnest(variable_ids) as variable_id
),

dataset_used_in_charts AS (
    SELECT
        d.id dataset_id,
        any_value(d.name) dataset_name,
        any_value(d.createdAt) dataset_created_at,
        count(distinct uc.chart_id) n_charts
    FROM `prod_mysql.variables` v
    inner join prod_mysql.datasets d on (v.datasetId = d.id)
    left join var_used_in_charts uc on (v.id = variable_id)
    group by dataset_id
),

unused_datasets as (
    select * 
    from dataset_used_in_charts
    where n_charts = 0
)

select
    dataset_id,
    dataset_name,
    dataset_created_at,
    v.id variable_id,
    v.name variable_name
from unused_datasets ud
inner join prod_mysql.variables v on (v.datasetId = ud.dataset_id)
where dataset_created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 180 DAY)
order by dataset_id, variable_id

Roughly 76k variables, which is half of the 150k variables that we have.

larsyencken commented 2 years ago

A spreadsheet of all the datasets we would might remove looks like this: https://docs.google.com/spreadsheets/d/102G5QuTgzqwGAbCAK0MnKBeu6G_r5iVmFzzQpmd3FqA/edit?usp=sharing

danyx23 commented 2 years ago

@larsyencken you are not filtering to the owid namespace though, right? Hannah said bulk imports should be excluded

edomt commented 2 years ago

Thanks for sharing @larsyencken! Dataset 65 is listed on the sheet, but it has 1 live chart. Maybe something needs to be adjusted in the query?

larsyencken commented 2 years ago

I wrote some notes in the next tab about a possible issue with the query -- I'll redo it using chart_dimensions and see if it fixes the issue. I'll also add the namespace for filtering, though I am curious about which bulk imports have never been used.

larsyencken commented 2 years ago

I updated the sheet again now with the fixed query, it correctly picks up and excludes dataset 65 and ~200 others that were actually being used: https://docs.google.com/spreadsheets/d/102G5QuTgzqwGAbCAK0MnKBeu6G_r5iVmFzzQpmd3FqA/edit#gid=610936695

As a reminder, the rationale for looking at this is because we are trying to move to a new data model where we automatically republish anything that's non private, making it discoverable and available to any member of the public. That means that the vast majority of these datasets would become public if we migrated them. The only reason not to do this is if we think that these imported-but-unused datasets are actually not in good enough shape to republish.

We also have the option to "archive" them in place. If we did that, we would not be deleting them, but we would be hiding them from the admin, we would not migrate them, and someone would need to manually reenable a specific dataset if we wanted to use it on our site.

@HannahRitchie and @edomt How should I think about big five-year-old bulk imports like the FAO trade matrix from 2017 or the World Bank Atlas of Social Protection from the same year? Is there some cutoff after which we can say, "anything before this probably doesn't meet our data standards" and "if we need it we'll just import again", or do we want to keep and republish all bulk imports ever?

edomt commented 2 years ago

I don't think we want to keep and republish all bulk imports forever. A lot of them won't be up to our current standards, and I don't expect many people to make use of these old versions anyway. I think that bulk-imported datasets without any chart and older than 3 years could be removed. But maybe @HannahRitchie has very good reasons against doing this.

HannahRitchie commented 2 years ago

@larsyencken I agree with @edomt and have nothing to add :)

larsyencken commented 2 years ago

Thank you both! Prioritised it now as "important", since it affects our future API and backporting efforts, and estimated as 3d, although those days might be spread across a week or so.

Marigold commented 2 years ago

Taking this over from @larsyencken to finish the last mile. I've updated his list to not delete bulk imported datasets younger than 3 years (non-bulk have limit of 180 days). Changing it to 2 years would only delete un_sdg_2019 and faostat_2020.

@edomt could you please quickly skim over the list here? If it looks ok I'll prepare the migration.

edomt commented 2 years ago

Thank you @Marigold!

All good on my end.

I'm assigning it back to @HannahRitchie and @maxroser because the list is quite long (1400 datasets) and I want to make sure everyone is fully on board with deleting half of our database :)

HannahRitchie commented 2 years ago

Thank you very much @Marigold!

Bulk imports

These all look fine to delete, but I would suggest we keep all datasets in the namespace clioinfra for now. Do not delete those ones.

owid non-bulk imports

🚨 Let's not delete any of those at the moment.

Instead, it would be really useful to have a list that I could work through myself to check them all. Would that be possible?

The reason is that some of these may actually be in use. For example, we nearly deleted one which is powering a data explorer (as flagged in the internal notes).

Bulk imports we might want later

Just an FYI for @edomt for later on. We're deleting some datasets here, such as ilostat, qog, and oecd_stat (which I think is fine because they're not being used). But some researchers might really benefit from updated bulk imports of these later (I don't really use them, but possibly Esteban, Bastian etc.).

edomt commented 2 years ago

Just an FYI for @edomt for later on. We're deleting some datasets here, such as ilostat, qog, and oecd_stat (which I think is fine because they're not being used). But some researchers might really benefit from updated bulk imports of these later (I don't really use them, but possibly Esteban, Bastian etc.).

Good to know, thanks – indeed I think that if researchers end up needing them, we'll want a clean import of the most up-to-date version.

Marigold commented 2 years ago

The reason is that some of these may actually be in use. For example, we nearly deleted one

Oh shoot, glad you spotted it.

Instead, it would be really useful to have a list that I could work through myself to check them all. Would that be possible?

@HannahRitchie here's the spreadsheet, is it enough or should I prepare something different?

@larsyencken an alternative to deleting datasets would be excluding them from backporting. Might be safer for now and we can revisit deletion / archival in the future.

larsyencken commented 2 years ago

Ah, ok. So the explorer-powering-route is something like:

Dataset is marked as "is publishable"
Dataset gets auto-published to owid-datasets https://github.com/owid/owid-datasets repo as a CSV
Explorer points at the CSV in the owid-datasets repo

We should ideally find a way to follow these links automatically and use them to prevent accidental deletion. Maybe as a first cut we can add these links to BigQuery in the analytics repo and resolve the links that way?

@marigold Agree on archiving instead of deleting, suggest you add an a flag to grapher datasets that archives them. But an archived dataset should not show up in search, and should not be possible to publish a grapher chart from. I.e. it should be totally hidden, and only restorable with some manual SQL.

On Wed, 25 May 2022 at 15:58, Mojmir Vinkler @.***> wrote:

The reason is that some of these may actually be in use. For example, we nearly deleted one https://owid.cloud/admin/datasets/5352

Oh shoot, glad you spotted it.

Instead, it would be really useful to have a list that I could work through myself to check them all. Would that be possible?

@HannahRitchie https://github.com/HannahRitchie here's the spreadsheet, is it enough or should I prepare something different?

@larsyencken https://github.com/larsyencken an alternative to deleting datasets would be excluding them from backporting. Might be safer for now and we can revisit deletion / archival in the future.

— Reply to this email directly, view it on GitHub https://github.com/owid/owid-grapher/issues/1338#issuecomment-1137280220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAIZLBGI236DCU5RLGT3PTVLYWW3ANCNFSM5SBPT72A . You are receiving this because you were mentioned.Message ID: @.***>

HannahRitchie commented 2 years ago

@HannahRitchie here's the spreadsheet, is it enough or should I prepare something different?

@Marigold Not sure if it would be possible to also include a link to the dataset as a column so that we can click through them quickly? Going to the grapher to search adds quite a bit of time. If it takes you a bit of time to do that, then it's fine to leave as it is :)

Would it be possible to give others edit access (currently it's view only).

Marigold commented 2 years ago

@HannahRitchie my bad, added that link and column Archive (Y/N) to the spreadsheet.

Agree on archiving instead of deleting, suggest you add an a flag to grapher datasets that archives them. But an archived dataset should not show up in search

@larsyencken good call, that's the safest way. I'm gonna add the new column isArchived to datasets in the migration.

and should not be possible to publish a grapher chart from. I.e. it should be totally hidden, and only restorable with some manual SQL.

I'm not gonna do it as part of this task, but I'm setting up a reminder 3 months from now to check how many archived datasets were actually used in a chart.

HannahRitchie commented 2 years ago

Thanks a lot @Marigold! I have filled out most of the spreadsheet with Y/N. Just asked a few authors to verify if their datasets can be deleted. Should have it completed in the next day :)

larsyencken commented 2 years ago

I'm not gonna do it as part of this task, but I'm setting up a reminder 3 months from now to check how many archived datasets were actually used in a chart.

Perhaps we can ask someone else to help with filtering them out, once they're marked in the database as archived. The goal by the end of this cycle is to remove all coverage gaps between grapher as a data source and the ETL, I think doing this on the grapher side will help.

HannahRitchie commented 2 years ago

@Marigold – I've added a Y/N to all of the owid datasets now. I haven't for the bulk imports – not sure if you need this?

If there's anything else you need, let me know. Thank you.

larsyencken commented 2 years ago

Reassigning to @Marigold now, but I have a feeling this perhaps got done already?

Marigold commented 2 years ago

That's right, it dragged because of the migration on live server, but that's done already.

larsyencken commented 2 years ago

@Marigold The last checkbox here was to hide archived data from the chart builder -- did that get done too?

Marigold commented 2 years ago

Shoot, sorry, reopening. There's an issue for it, but it hasn't been picked yet (probably needs triage).

owid / owid-grapher