lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
110 stars 44 forks source link

[Fix] missing dates in global Vaccinations data & colored map plot #728

Closed Inglezos closed 3 years ago

Inglezos commented 3 years ago

Summary

Upon jhu_data.map(country="China", variable="Confirmed") execution, it throws error:

  line 1, in <module>
    jhu_data.map(country="China", variable="Confirmed")

  \cleaning\jhu_data.py", line 684, in map
    return self._colored_map_country(

  \cleaning\cbase.py", line 476, in _colored_map_country
    self._colored_map(title=title, data=df, level=self.PROVINCE, **kwargs)

  \cleaning\cbase.py", line 407, in _colored_map
    cm.plot(**find_args([gpd.GeoDataFrame.plot, ColoredMap.plot], **kwargs))

  \visualization\colored_map.py", line 88, in plot
    gdf = self._country_specific_data(data, included=included, excluded=excluded, country=country)

  \visualization\colored_map.py", line 186, in _country_specific_data
    hkg_gdf = gdf.loc[gdf[self.ISO3] == "HKG"].dissolve()

  \site-packages\geopandas\geodataframe.py", line 951, in dissolve
    aggregated_data = data.groupby(by=by).agg(aggfunc)

  \site-packages\pandas\core\frame.py", line 6508, in groupby
    raise TypeError("You have to supply one of 'by' and 'level'")

TypeError: You have to supply one of 'by' and 'level'

Codes

import covsirphy as cs

data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
oxcgrt_data = data_loader.oxcgrt()

jhu_data.map(country="China", variable="Confirmed")

Environment

lisphilar commented 3 years ago

The error seems an error in geopandas.GeoDataFrame.dissolve()? In my environment, error was not raised and I got a figure successfully. I used version 2.19.1-beta-fu1, but no changes regarding JHUData from 2.19.1-beta to 2.19.1-beta-fu1.

import covsirphy as cs
data_loader = cs.DataLoader(directory="input")
jhu_data = data_loader.jhu()
jhu_data.map(country="China", variable="Confirmed")

Figure_1

I used GeoPandas 0.9.0 and Pandas 1.1.5.

import geopandas as gpd
gpd.__version__
import pandas as pd
pd.__version__
Inglezos commented 3 years ago

Alright I updated the geopandas dependencies and finally the error is resolved. However, if I run

import covsirphy as cs

data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
oxcgrt_data = data_loader.oxcgrt()

jhu_data.map(country="Japan", variable="Confirmed", date="15Apr2021")
vaccine_data.map(date="15Apr2021")

the results are image image

Japan result is on my pc, vaccinations africa is in Colab. On my pc the vaccinations map was totally wrong, almost all countries were dash-grey as missing. Is this connected to date argument or is it from dependencies of geopandas? I use the same versions for geopandas and pandas as you did. However, in Colab this should not happen if that was the problem, right?

Inglezos commented 3 years ago

I tried to debug it locally on my PC and I got something weird about the countries that are included in the df: test1 test2

The first image is from breakpoint at line df = self._cleaned_df.copy() (with step over), which means the df right after it was assigned self._cleaned_df.copy() and the second image is right before plotting on line self._colored_map(title=title, data=df, level=self.COUNTRY, **kwargs).

It seems like the countries the map includes are only the ones starting with A-F letter for some reason. The plot result right after is this: image

which seems to confirm this bug assumption.

The code I run is vaccine_data.map(date="10Apr2021"):

import matplotlib.pyplot as plt
import pandas as pd
import covsirphy as cs

cs.get_version()
get_ipython().run_line_magic('matplotlib', 'inline')
pd.plotting.register_matplotlib_converters()

# Matplotlib
plt.style.use("seaborn-ticks")
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
plt.rcParams["figure.dpi"] = (120)

data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
pcr_data = data_loader.pcr()
oxcgrt_data = data_loader.oxcgrt()
vaccine_data = data_loader.vaccine()

vaccine_data.map(date="10Apr2021")

on my PC windows 10, covsirphy version: "CovsirPhy v2.19.1-gamma-fu2"

lisphilar commented 3 years ago

.map() works for me in my local environment (Ubuntu with "Windows Subsystem for Linux" and Python 3.9.2).

import covsirphy as cs
data_loader = cs.DataLoader()
jhu_data = data_loader.jhu()
jhu_data.map(country="Japan", variable="Confirmed", date="15Apr2021")
vaccine_data = data_loader.vaccine()
vaccine_data.map(date="10Apr2021")

Figure_1 Figure_2

In your environment, vaccine_data._cleaned_df.Country.unique().tolist() does not include G-Z countries?

Inglezos commented 3 years ago

In your environment, vaccine_data._cleaned_df.Country.unique().tolist() does not include G-Z countries?

['Afghanistan', 'Africa', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Asia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Chile', 'China', 'Colombia', 'Congo', 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Curacao', 'Cyprus', 'Czechia', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England', 'Equatorial Guinea', 'Estonia', 'Eswatini', 'Ethiopia', 'Europe', 'European Union', 'Faeroe Islands', 'Falkland Islands', 'Fiji']

I think I have found what's going on. The vaccinations source from owid has changed drastically. The file we use ourworldindata_vaccine.csv if you check it manually (last updated today, force download it) contains only A-F and corresponds to aggregated data. This is what we use for vaccinations as data source right?

In https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations, there is a completely new folder named country_data for the countries vaccinations with separate csv file for each country.

lisphilar commented 3 years ago

We use "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv" and this includes G-Z countries. Repo: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv

Inglezos commented 3 years ago

image That's the last entry right after line rec_df = self.load(self.URL_REC, columns=list(set(rename_dict) - set(["vaccines"]))) during data download.

Also if I check the download dataset manually: image

It seems to be a problem during data downloading in dataloader. I see that in cbase.load() the dask dd.read_csv() is used. Could it be a problem in dask?

lisphilar commented 3 years ago

Not caused in my environment... Could you check the outputs of cbase.load() and dd.reaa_csv() with the URL? https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv

Inglezos commented 3 years ago

I changed line return dd.read_csv(urlpath, blocksize=None, **kwargs).compute() to return pd.read_csv(urlpath, **kwargs) and now it reads all the countries.

Nevertheless, Africa data visualization for vaccines seem weird progressing from 10Apr21 to today:

vaccine_data.map(date="10Apr2021")
vaccine_data.map(date="12Apr2021")
vaccine_data.map(date="15Apr2021")
vaccine_data.map(date="18Apr2021")
vaccine_data.map(date="20Apr2021")
vaccine_data.map(date="22Apr2021")
vaccine_data.map(date="23Apr2021")
vaccine_data.map()

image image image image image image image image

I ran this also in Colab and I get the same results for Africa vaccinations map.

Could it be source data problem or date argument in map()?

Note that for both: vaccine_data.map(date="23Apr2021") and vaccine_data.map(), the title of the plot is the same but the results are different (last two plots).

[Update:] It seems to be a data source problem, for example for South Africa: image

the data goes up until 13th of April only (instead of 24Apr21/today) and also there are some missing past values. I noticed also that in the raw dataset there is a country name "Africa". Could the actual vaccinations be aggregated under that name instead? Because Africa is the whole continent.

Do we apply any kind of complement/fillna to the vaccinations?

Inglezos commented 3 years ago

I have found the problem. For many countries in the Africa continent (at least, I haven't checked if this is happening for other continents as well), the vaccinations source data do not continue up to today. This means that each country has its own dates range. So when we project on a map a screenshot of the vaccinations for one day, this day might not be included in the dates range for every country.

For example South Africa has vaccine records up to 13Apr21, Algeria up to 19Feb21, Botswana up to 10Apr21, Mali up to 12Apr21 and so on. When we call map() for let's say 08Apr21, we will see South Africa, Botswana and Mali to have valid color-code/values but not for Algeria. If we call map() for 15Apr21, all these countries will appear invalid though, because 15Apr21 is outside their dates range.

What we need to do is to extend the last date's vaccine value up to today, including all the in-between days while keeping constant the last value (ffill), and for every country with end date less than today's date.

lisphilar commented 3 years ago

Do you mean vaccination data is not updated regurarly (at daily basis)? Or, stopped updating for some countries? If the formar case, please consider to update VaccineData._cleaning() method.

Inglezos commented 3 years ago

I mean that for some countries there are not daily records at all for vaccinations. Not only the value of the number of vaccines is missing, but also the complete row/record with date and rest info is missing, and this happens over many days up to today.

lisphilar commented 3 years ago

If you have time, could you try updating VaccineData._cleaning() to solve this issue?

Inglezos commented 3 years ago

Reworked VaccineData._cleaning() - Extended vaccine data up until today for missing records.

image image

The missing countries in Africa for 10Apr21 are expected since then data for Congo or Libya for example were not available yet. In today's plot however, Congo should have values, I don't know why it doesn't plot them, could it be a geopandas issue? You can see manually that the dates are all filled with the last data up to today if you call vaccine_data._cleaned_df or .subset() (the horizontal legend is personal preference).

lisphilar commented 3 years ago

Thank you for your pull request!! I added some comments just for speed and refactoring.

Congo should have values, I don't know why it doesn't plot them, could it be a geopandas issue?

It may be a mismatch of geometry information (gpd.datasets.get_path("naturalearth_lowres") in line 216 of colored_map.py) and .subset().

the horizontal legend is personal preference

Oh, we discussed the location of legend somwhere, but I forgot to move it. Could you update ColredMap class with a new issue?