Closed Inglezos closed 3 years ago
The error seems an error in geopandas.GeoDataFrame.dissolve()
?
In my environment, error was not raised and I got a figure successfully. I used version 2.19.1-beta-fu1, but no changes regarding JHUData
from 2.19.1-beta to 2.19.1-beta-fu1.
import covsirphy as cs
data_loader = cs.DataLoader(directory="input")
jhu_data = data_loader.jhu()
jhu_data.map(country="China", variable="Confirmed")
I used GeoPandas 0.9.0 and Pandas 1.1.5.
import geopandas as gpd
gpd.__version__
import pandas as pd
pd.__version__
Alright I updated the geopandas dependencies and finally the error is resolved. However, if I run
import covsirphy as cs
data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
oxcgrt_data = data_loader.oxcgrt()
jhu_data.map(country="Japan", variable="Confirmed", date="15Apr2021")
vaccine_data.map(date="15Apr2021")
the results are
Japan result is on my pc, vaccinations africa is in Colab. On my pc the vaccinations map was totally wrong, almost all countries were dash-grey as missing. Is this connected to date
argument or is it from dependencies of geopandas? I use the same versions for geopandas and pandas as you did. However, in Colab this should not happen if that was the problem, right?
I tried to debug it locally on my PC and I got something weird about the countries that are included in the df:
The first image is from breakpoint at line df = self._cleaned_df.copy()
(with step over), which means the df right after it was assigned self._cleaned_df.copy()
and the second image is right before plotting on line self._colored_map(title=title, data=df, level=self.COUNTRY, **kwargs)
.
It seems like the countries the map includes are only the ones starting with A-F letter for some reason. The plot result right after is this:
which seems to confirm this bug assumption.
The code I run is vaccine_data.map(date="10Apr2021")
:
import matplotlib.pyplot as plt
import pandas as pd
import covsirphy as cs
cs.get_version()
get_ipython().run_line_magic('matplotlib', 'inline')
pd.plotting.register_matplotlib_converters()
# Matplotlib
plt.style.use("seaborn-ticks")
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
plt.rcParams["figure.dpi"] = (120)
data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
pcr_data = data_loader.pcr()
oxcgrt_data = data_loader.oxcgrt()
vaccine_data = data_loader.vaccine()
vaccine_data.map(date="10Apr2021")
on my PC windows 10, covsirphy version: "CovsirPhy v2.19.1-gamma-fu2"
.map()
works for me in my local environment (Ubuntu with "Windows Subsystem for Linux" and Python 3.9.2).
import covsirphy as cs
data_loader = cs.DataLoader()
jhu_data = data_loader.jhu()
jhu_data.map(country="Japan", variable="Confirmed", date="15Apr2021")
vaccine_data = data_loader.vaccine()
vaccine_data.map(date="10Apr2021")
In your environment, vaccine_data._cleaned_df.Country.unique().tolist()
does not include G-Z countries?
In your environment, vaccine_data._cleaned_df.Country.unique().tolist() does not include G-Z countries?
['Afghanistan', 'Africa', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Asia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Chile', 'China', 'Colombia', 'Congo', 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Curacao', 'Cyprus', 'Czechia', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England', 'Equatorial Guinea', 'Estonia', 'Eswatini', 'Ethiopia', 'Europe', 'European Union', 'Faeroe Islands', 'Falkland Islands', 'Fiji']
I think I have found what's going on. The vaccinations source from owid has changed drastically. The file we use ourworldindata_vaccine.csv
if you check it manually (last updated today, force download it) contains only A-F and corresponds to aggregated data. This is what we use for vaccinations as data source right?
In https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations,
there is a completely new folder named country_data
for the countries vaccinations with separate csv file for each country.
We use "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv" and this includes G-Z countries. Repo: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv
That's the last entry right after line rec_df = self.load(self.URL_REC, columns=list(set(rename_dict) - set(["vaccines"])))
during data download.
Also if I check the download dataset manually:
It seems to be a problem during data downloading in dataloader.
I see that in cbase.load()
the dask dd.read_csv()
is used. Could it be a problem in dask?
Not caused in my environment...
Could you check the outputs of cbase.load()
and dd.reaa_csv()
with the URL?
https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv
I changed line return dd.read_csv(urlpath, blocksize=None, **kwargs).compute()
to return pd.read_csv(urlpath, **kwargs)
and now it reads all the countries.
Nevertheless, Africa data visualization for vaccines seem weird progressing from 10Apr21 to today:
vaccine_data.map(date="10Apr2021")
vaccine_data.map(date="12Apr2021")
vaccine_data.map(date="15Apr2021")
vaccine_data.map(date="18Apr2021")
vaccine_data.map(date="20Apr2021")
vaccine_data.map(date="22Apr2021")
vaccine_data.map(date="23Apr2021")
vaccine_data.map()
I ran this also in Colab and I get the same results for Africa vaccinations map.
Could it be source data problem or date
argument in map()
?
Note that for both: vaccine_data.map(date="23Apr2021")
and vaccine_data.map()
, the title of the plot is the same but the results are different (last two plots).
[Update:] It seems to be a data source problem, for example for South Africa:
the data goes up until 13th of April only (instead of 24Apr21/today) and also there are some missing past values. I noticed also that in the raw dataset there is a country name "Africa". Could the actual vaccinations be aggregated under that name instead? Because Africa is the whole continent.
Do we apply any kind of complement/fillna to the vaccinations?
I have found the problem. For many countries in the Africa continent (at least, I haven't checked if this is happening for other continents as well), the vaccinations source data do not continue up to today. This means that each country has its own dates range. So when we project on a map a screenshot of the vaccinations for one day, this day might not be included in the dates range for every country.
For example South Africa has vaccine records up to 13Apr21, Algeria up to 19Feb21, Botswana up to 10Apr21, Mali up to 12Apr21 and so on. When we call map()
for let's say 08Apr21, we will see South Africa, Botswana and Mali to have valid color-code/values but not for Algeria. If we call map()
for 15Apr21, all these countries will appear invalid though, because 15Apr21 is outside their dates range.
What we need to do is to extend the last date's vaccine value up to today, including all the in-between days while keeping constant the last value (ffill), and for every country with end date less than today's date.
Do you mean vaccination data is not updated regurarly (at daily basis)? Or, stopped updating for some countries? If the formar case, please consider to update VaccineData._cleaning()
method.
I mean that for some countries there are not daily records at all for vaccinations. Not only the value of the number of vaccines is missing, but also the complete row/record with date and rest info is missing, and this happens over many days up to today.
If you have time, could you try updating VaccineData._cleaning()
to solve this issue?
Reworked VaccineData._cleaning()
- Extended vaccine data up until today for missing records.
The missing countries in Africa for 10Apr21 are expected since then data for Congo or Libya for example were not available yet. In today's plot however, Congo should have values, I don't know why it doesn't plot them, could it be a geopandas issue? You can see manually that the dates are all filled with the last data up to today if you call vaccine_data._cleaned_df
or .subset()
(the horizontal legend is personal preference).
Thank you for your pull request!! I added some comments just for speed and refactoring.
Congo should have values, I don't know why it doesn't plot them, could it be a geopandas issue?
It may be a mismatch of geometry information (gpd.datasets.get_path("naturalearth_lowres")
in line 216 of colored_map.py) and .subset()
.
the horizontal legend is personal preference
Oh, we discussed the location of legend somwhere, but I forgot to move it. Could you update ColredMap
class with a new issue?
Summary
Upon
jhu_data.map(country="China", variable="Confirmed")
execution, it throws error:Codes
Environment