owid / covid-19-data

Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data
https://ourworldindata.org/coronavirus
5.66k stars 3.64k forks source link

JSON Vaccinations file #500

Closed RiccardoBorchi closed 3 years ago

RiccardoBorchi commented 3 years ago

Hello,

is it possible to create a JSON file for vaccinations?

Thanks.

Best Regards.

edomt commented 3 years ago

Hi @RiccardoBorchi

Do you mean a JSON version of this file? https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv Or are you referring to another file? (we also have country-level files and another dataset for the US states)

RiccardoBorchi commented 3 years ago

Hi @edomt,

yes, I mean a JSON version of that file (vaccinations.csv).

Thank you!

lucasrodes commented 3 years ago

@edomt I think this could be really useful for people accessing in an api-like way.

I have recently worked on a similar thing (see https://sociepy.org/covid19-vaccination-subnational/data/api/v1/), where I had vaccination data as CSV and transformed it to get API-like data.

If you find this useful, I can work on this. I would only need some hints on where to place the script.

A possible format could be:

[
{
  "country": "Albania",
  "iso_code": "ALB",
  "data": [
    {
      "date": "2021-01-12",
      "total_vaccinations": 128,
      "people_vaccinated": 128,
      ...
    },
    ...
  ]
},
{
  "country": "Andorra",
  "iso_code": "AND",
  ...
},
]
edomt commented 3 years ago

@lucasrodes Can you write in R as well? 😬 Ideally it would be part of generate_vaccinations_file.R, probably inside generate_vaccinations_file() after line 199.

Don't worry if you can't though! I can probably find time to do that (but in the next week or so, rather than in the next few days).

lucasrodes commented 3 years ago

@edomt I will look into this, no problem! 😄

Long time no R, but happy to get my hands dirty with R again!

Just let me know if you take this and start working on it, to avoid duplicating efforts

ValentinMouret commented 3 years ago

How about (for the Python part, I unfortunately don’t know R enough to be useful):

import json

import pandas as pd

def to_json(data: pd.DataFrame) -> list:
    location_iso_codes = {tuple(x) for x in data[["location", "iso_code"]].values.tolist()}
    metrics = [column for column in data.columns if column not in {"location", "iso_code"}]
    return [
        {
            "country": location,
            "iso_code": iso_code,
            "data": data[(data.location == location) & (data.iso_code == iso_code)][metrics].values.tolist()
        }
        for location, iso_code in location_iso_codes
    ]

source = "vaccinations.csv"
destination = "vaccinations.json"

with open(destination, "w") as f:
    json.dump(pd.read_csv(source).pipe(to_json), f)

json.dump(..., indent=4) outputs a sort of pretty printed JSON (almost 36k lines and 750KB).

lucasrodes commented 3 years ago

Nice snippet, thanks for your proposal @ValentinMouret!

Proposed modifications

Just in case we'd go for python, after testing it, I realized three things:

1. Some locations lack ISO code

>>> import pandas as pd
>>> data = pd.read_csv("vaccinations.csv")
>>> data[data.iso_code.isnull()].location.unique()
array(['England', 'Northern Ireland', 'Scotland', 'Wales'], dtype=object)

Solution: Leave these locations out of JSON. Eventually solve this in source csv file.

2. NaN values (in metrics or in missing ISO codes) lead to JSON renderer failures

Solution: Replace NaN values with an empty string or -1 in case we want to keep them numeric.

3. Preserve key names in data field

Solution: Use to_dict instead of values.tolist().

4. Final modified code

import json

import pandas as pd

def to_json(data: pd.DataFrame) -> list:
    location_iso_codes = {tuple(x) for x in data[["location", "iso_code"]].dropna().values.tolist()}
    metrics = [column for column in data.columns if column not in {"location", "iso_code"}]
    return [
        {
            "country": location,
            "iso_code": iso_code,
            "data": (
                data[(data.location == location) & (data.iso_code == iso_code)][metrics]
                .fillna(-1).to_dict(orient="records")
            )
        }
        for location, iso_code in location_iso_codes
    ]

source = "vaccinations.csv"
destination = "vaccinations.json"

with open(destination, "w") as f:
    json.dump(pd.read_csv(source).pipe(to_json), f, indent=4)

PS: I am a bit skeptical on the -1 thing

ValentinMouret commented 3 years ago

Good catch.

Why not add their ISO codes? Apparently, those are: code subdivision name subdivision category
GB-ENG England country
GB-NIR Northern Ireland province
GB-SCT Scotland country
GB-WLS Wales [Cymru GB-CYM] country

I would say ENG, NIR, SCT, and WLS should do fine.

lucasrodes commented 3 years ago

I believe iso_code field in the data refers to ISO 3166-1 alpha-3, which I just learnt are only given to so-called "sovereign countries".

In the case of England, Northern Ireland, Scotland, and Wales, the codes aforementioned would be ISO 3166-2 standard instead.

The ISO codes are extremely useful when crossing this data with other data. So probably we should be clear about what iso_code would contain (so far, ISO 3166-1 alpha-3). Maybe worth opening another issue...

edomt commented 3 years ago

Indeed, these codes are ISO 3166-1 alpha-3 codes (and they need to stay under this format, as they're used later in our pipeline for our database and charts).

We could, however, add "custom" ISO codes in vaccinations.csv for the 4 locations with missing codes: GBR-ENG, GBR-NIR, GBR-SCT, GBR-WLS (that shouldn't mess with our system since they won't be recognized).

As for R vs Python, an integration within generate_vaccinations_file.R would make the most sense, but if it's easier to write a Python script (that reads the CSV and transforms it into a JSON version) and store it in /scripts/scripts/vaccinations, I can run it automatically right after the R script.

edomt commented 3 years ago

This is now live: https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.json

edomt commented 3 years ago

Thank you @RiccardoBorchi for the suggestion, and @lucasrodes (as well as @ValentinMouret) for the code!

RiccardoBorchi commented 3 years ago

Great @edomt, thank you!

I will use it on my dashboard.

Thanks!

RiccardoBorchi commented 3 years ago

Hi @edomt,

the world data are missing in the JSON file. Could you insert them?

Thanks!

edomt commented 3 years ago

FYI, there are now custom ISO codes for UK nations (see https://github.com/owid/covid-19-data/issues/702#issuecomment-800244161).