Closed RiccardoBorchi closed 3 years ago
Hi @RiccardoBorchi
Do you mean a JSON version of this file? https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv Or are you referring to another file? (we also have country-level files and another dataset for the US states)
Hi @edomt,
yes, I mean a JSON version of that file (vaccinations.csv).
Thank you!
@edomt I think this could be really useful for people accessing in an api-like way.
I have recently worked on a similar thing (see https://sociepy.org/covid19-vaccination-subnational/data/api/v1/), where I had vaccination data as CSV and transformed it to get API-like data.
If you find this useful, I can work on this. I would only need some hints on where to place the script.
A possible format could be:
[
{
"country": "Albania",
"iso_code": "ALB",
"data": [
{
"date": "2021-01-12",
"total_vaccinations": 128,
"people_vaccinated": 128,
...
},
...
]
},
{
"country": "Andorra",
"iso_code": "AND",
...
},
]
@lucasrodes Can you write in R as well? 😬
Ideally it would be part of generate_vaccinations_file.R
, probably inside generate_vaccinations_file()
after line 199.
Don't worry if you can't though! I can probably find time to do that (but in the next week or so, rather than in the next few days).
@edomt I will look into this, no problem! 😄
Long time no R, but happy to get my hands dirty with R again!
Just let me know if you take this and start working on it, to avoid duplicating efforts
How about (for the Python part, I unfortunately don’t know R enough to be useful):
import json
import pandas as pd
def to_json(data: pd.DataFrame) -> list:
location_iso_codes = {tuple(x) for x in data[["location", "iso_code"]].values.tolist()}
metrics = [column for column in data.columns if column not in {"location", "iso_code"}]
return [
{
"country": location,
"iso_code": iso_code,
"data": data[(data.location == location) & (data.iso_code == iso_code)][metrics].values.tolist()
}
for location, iso_code in location_iso_codes
]
source = "vaccinations.csv"
destination = "vaccinations.json"
with open(destination, "w") as f:
json.dump(pd.read_csv(source).pipe(to_json), f)
json.dump(..., indent=4)
outputs a sort of pretty printed JSON (almost 36k lines and 750KB).
Nice snippet, thanks for your proposal @ValentinMouret!
Just in case we'd go for python, after testing it, I realized three things:
>>> import pandas as pd
>>> data = pd.read_csv("vaccinations.csv")
>>> data[data.iso_code.isnull()].location.unique()
array(['England', 'Northern Ireland', 'Scotland', 'Wales'], dtype=object)
Solution: Leave these locations out of JSON. Eventually solve this in source csv file.
Solution: Replace NaN values with an empty string or -1 in case we want to keep them numeric.
data
fieldSolution: Use to_dict
instead of values.tolist()
.
import json
import pandas as pd
def to_json(data: pd.DataFrame) -> list:
location_iso_codes = {tuple(x) for x in data[["location", "iso_code"]].dropna().values.tolist()}
metrics = [column for column in data.columns if column not in {"location", "iso_code"}]
return [
{
"country": location,
"iso_code": iso_code,
"data": (
data[(data.location == location) & (data.iso_code == iso_code)][metrics]
.fillna(-1).to_dict(orient="records")
)
}
for location, iso_code in location_iso_codes
]
source = "vaccinations.csv"
destination = "vaccinations.json"
with open(destination, "w") as f:
json.dump(pd.read_csv(source).pipe(to_json), f, indent=4)
PS: I am a bit skeptical on the -1 thing
Good catch.
Why not add their ISO codes? Apparently, those are: | code | subdivision name | subdivision category |
---|---|---|---|
GB-ENG | England | country | |
GB-NIR | Northern Ireland | province | |
GB-SCT | Scotland | country | |
GB-WLS | Wales [Cymru GB-CYM] | country |
I would say ENG
, NIR
, SCT
, and WLS
should do fine.
I believe iso_code
field in the data refers to ISO 3166-1 alpha-3, which I just learnt are only given to so-called "sovereign countries".
In the case of England, Northern Ireland, Scotland, and Wales, the codes aforementioned would be ISO 3166-2 standard instead.
The ISO codes are extremely useful when crossing this data with other data. So probably we should be clear about what iso_code
would contain (so far, ISO 3166-1 alpha-3). Maybe worth opening another issue...
Indeed, these codes are ISO 3166-1 alpha-3 codes (and they need to stay under this format, as they're used later in our pipeline for our database and charts).
We could, however, add "custom" ISO codes in vaccinations.csv
for the 4 locations with missing codes: GBR-ENG
, GBR-NIR
, GBR-SCT
, GBR-WLS
(that shouldn't mess with our system since they won't be recognized).
As for R vs Python, an integration within generate_vaccinations_file.R
would make the most sense, but if it's easier to write a Python script (that reads the CSV and transforms it into a JSON version) and store it in /scripts/scripts/vaccinations
, I can run it automatically right after the R script.
Thank you @RiccardoBorchi for the suggestion, and @lucasrodes (as well as @ValentinMouret) for the code!
Great @edomt, thank you!
I will use it on my dashboard.
Thanks!
Hi @edomt,
the world data are missing in the JSON file. Could you insert them?
Thanks!
FYI, there are now custom ISO codes for UK nations (see https://github.com/owid/covid-19-data/issues/702#issuecomment-800244161).
Hello,
is it possible to create a JSON file for vaccinations?
Thanks.
Best Regards.