owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
78 stars 21 forks source link

Add at least one dataset level tag to all datasets that have new metadata #1868

Closed danyx23 closed 9 months ago

danyx23 commented 11 months ago

For the new datapages it would be extremely useful if we could guarantee that every indicator has at least one topic_tag assigned. This will ensure that we can link our topic pages with our data pages and help in showing the most relevant content in various places.

The easiest way to ensure this is to set one (or more) tag at the dataset level. This is not always appropriate but works in many cases. Tags have to match a topic tag in the database (case sensitive). The schema has been updated to include the current topic tags so you should get autocomplete in the yaml file for this field. The only non-topic that that can be used in "Uncategorized" which can be used for cases where you explicitly don't want to assign a topic (e.g. useful in cases like the WDI where every indicator needs a human or AI assigned tag and you don't have the time to do it yet).

The easiest way to assign a tag to all indicators in a dataset is to set it via the common section in the yaml file:

definitions:
  common:
    presentation:
      topic_tags:
      - Poverty

As of today we have about 100 datasets that have at least one indicator with new metadata. The list below has a checkmark to indicate if a tag has been added. Then it states the number of indicator the dataset has, the dataset name and the name is linked to the probable location of the metadata yaml file that you need to edit with the above snippet.

@lucasrodes

From Lucas

(working on #1869)

From Veronika

(https://github.com/owid/etl/pull/1872)

From Fiona

(https://github.com/owid/etl/pull/1872)

@pabloarosado

From Pablo R

From Mojmir

From Pablo A

Others

Datasette query for the above list The following datasette query was used to arrive at the above list: ```sql with datasets_with_counts as ( select d.name as name, count(*) as variableCount, json_group_array(distinct v.catalogPath) as paths from datasets d join variables v on v.datasetId = d.id where v.schemaVersion = 2 group by d.id order by count(*) desc ) SELECT name, variableCount, json_each.value AS path FROM datasets_with_counts t, json_each(t.paths) ```
lucasrodes commented 11 months ago

I've restructured the list, making it easier to work on my datasets. I'd suggest you do the same, @pabloarosado. With the ones remaining, we can either distribute them among us both or let others know. Can decide once we're finished with ours.

pabloarosado commented 11 months ago

I've added tags to all datasets I've worked on: https://github.com/owid/etl/pull/1870 I've realised some of the datasets listed above are still using old metadata (e.g. Energy mix). I haven't added topic tags to those.

We can just distribute of the list half and half among us, and then add the others as reviewers.

pabloarosado commented 11 months ago

I see that grapher steps are failing (also in my local grapher). It's related to grapher_model, something connected to the displayOrder of the topic tags. Maybe some migration needs to be done in the database. @danyx23 were you expecting this or is it a bug?

lucasrodes commented 11 months ago

@pabloarosado when are you getting this error?

I was getting this error when using Jinja on topic_tags (see my comment).

Could be related to this update https://github.com/owid/etl/pull/1863? @Marigold

Marigold commented 11 months ago

@pabloarosado sorry, my bad! Could you please rebase on top of the master?

lucasrodes commented 11 months ago

@pabloarosado I've re-structured the list again to clarify which datasets each of us is tackling.

I've finished my part, you can review my changes here:

I've created an additional PR with the changes affecting datasets from others (no need to review this):

pabloarosado commented 11 months ago

It's not clear to me what to do with auxiliary indicators, like population or GDP, that we have sometimes within a dataset. For example, natural disasters has those indicators. These indicators only exist because it's convenient (for us and for superuser downloading the dataset) to have population and GDP next to the per capita and per gdp variables. They're useful to do sanity checks.

So, I'll tag them as "Uncategorized". Do you agree @lucasrodes ?

lucasrodes commented 11 months ago

Yeah, that sounds reasonable, @pabloarosado!

pabloarosado commented 11 months ago

I've checked most of the datasets in the list, since they were using the old metadata (including WDI). I understand that they don't need to be tagged.

@paarriagadap among the tags in this list, which ones would you assign (being the first tag the most important one) to the indicators of:

@spoonerf among the tags in this list, which ones would you assign (being the first tag the most important one) to the indicators of:

Thanks both!

spoonerf commented 11 months ago

Hey @pabloarosado!

I would go with Child & Infant Mortality for the United Nations Inter-agency Group for Child Mortality Estimation dataset.

Just be aware that the Datasette link you shared is showing only the first 101 topic tags ordered by id, but there are 128 topic tags. Maybe this link is better.

lucasrodes commented 11 months ago

@pabloarosado I'd use 'State Capacity' for 'State Capacity Dataset' and 'Colonial Dates Dataset'. And possible 'Human Development Index (AHDI)' for the 'Augmented Human Development Index'

paarriagadap commented 11 months ago

Hi! I was away last Friday. Yes, the ones described by @lucasrodes are the ones to use. It's Human Development Index (HDI), by the way.

pabloarosado commented 11 months ago

Hey @lucasrodes I think the two PRs are ready to be merged (assuming that there are no other new CI surprises with random steps). If everything goes well, shall we merge by the end of the day?

pabloarosado commented 9 months ago

This issue was completed late October, but I suppose we forgot to close it.