owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
78 stars 21 forks source link

Data download restrictions are not properly respected in prod #2631

Closed larsyencken closed 4 months ago

larsyencken commented 5 months ago

UPDATE: this was a false alarm due to a misunderstanding of isPrivate on Grapher datasets model

Context

For some data sources, we are allowed to share charts of the data to the public, but not to allow the public to directly download the data from us. This is because the upstream data provider does not allow redistribution.

Whilst this might seem restrictive, the data provider is still doing a great public good in allowing us to visualise and write about the data for a broad audience, and we in turn are helping to bring their work to a broad audience.

Problem

It appears that data download restrictions are not honoured by the Grapher download overlay. For example:

After debugging, the issue comes the ETL not generating the right metadata into our data API.

Expected behaviour

Instead of the CSV download option, we should have a note saying that the data provider disallows redistribution, or better a link back to the original data provider so that you can get the data from them.

Technical notes

Note on priority

If confirmed, this is high priority to fix, since it's essential that we respect the work of data providers and the restrictions they put on that.

larsyencken commented 5 months ago

This issue came up whilst trying to test a new chart-based API for data downloads, where we would like to throw a clean exception for the non-redistributable case.

larsyencken commented 5 months ago

@ikesau confirmed that Grapher respects it if the API indicates nonRedistributable = true, making this an API/ETL issue.

larsyencken commented 4 months ago

Turns out this was a false alarm due to a misunderstanding of the meaning of something being "private".

In the ETL, private means that the general public cannot access those files, except when they are published as indicators in the grapher:// step. At that stage, anything private should be marked as nonRedistributable in the metadata.

In Grapher, datasets marked as !isPrivate && !nonRedistributable are automatically re-published to Github. If something is !nonRedistributable, it means CSV download is available with Grapher.

This means !isPrivate should probably be renamed publishToGithub, and it should be false any time nonDistributable is true.