[Fix] DataEngineer().subset(country=None) calculates total values without filling NAs

shik-design commented 1 year ago

Summary of question

Good day! How can I analysis global data from DataEngineer() ?

Code

eng = cs.DataEngineer() 
eng.download(country=None,  province=None);
eng.clean()
eng.transform()
actual_df, status, _ = eng.subset(geo=("Nigeria",), variables="SIRF", complement=True)
print(status)
actual_df.tail()

I need geo=("Global",) if there is something like that.

I want to test the model at the global scale

lisphilar commented 1 year ago

Thank you for your question! Please try the next script and let me know whether this is suitable for your analysis or not. If suitable, I will update internal codes of covsirphy.

import covsirphy as cs
eng = cs.DataEngineer() 
eng.download(country=None,  province=None);
eng.clean()
eng.transform()
# Get country-level data
top_df = eng.layer(geo=None, variables="SIRF").drop(["Province", "City"], axis=1)
# Fill in NAs some countries (top-level administration) have
variables = list(set(top_df.columns) - set(["ISO3", "Date"]))
pivot_df = top_df.pivot_table(values=variables, index="Date", columns="ISO3", aggfunc="last")
filled_df = pivot_df.ffill().fillna(0).stack().reset_index()
# Recreate DataEngineer() instance with the filled data
eng2 = cs.DataEngineer(layers=["ISO3"])
eng2 = eng2.register(filled_df, citation=eng.citations())
eng2.inverse_transform()
# Get data at global scale
actual_df, status, _ = eng2.subset(geo=None, variables="SIRF", complement=True)
print(status)
actual_df.tail()

We can use .subset(geo=None) or .subset(geo=(None,)) to get data at global scale, but they just calculate total values at country level on dates. This makes some troubles because the first/last dates of records are different for countires.

shik-design commented 1 year ago

It worked very well!! No issues at all. For the sake of experimentation, I think that, you can also combine the already dowloaded file in the "input" folder.

How can I do that without necessarily dowloading the files every time? In the previous versions of covsirphy we cs.DataLoader("input", update_interval=24)

lisphilar commented 1 year ago

@shik-design Thank you for your confirmetion! We will execute the folowing codes after 2.27.1 release.

import covsirphy as cs
eng = cs.DataEngineer() 
eng.download(
    country=None, province=None,
    databases=["covid19dh", "japan", "owid"], directory="input", update_interval=24);
eng.clean()
eng.transform()
actual_df, status, _ = eng.subset(geo=None, variables="SIRF", complement=True)
print(status)
actual_df.tail()

How can I do that without necessarily dowloading the files every time? In the previous versions of covsirphy we cs.DataLoader("input", update_interval=24)

Please use the keyword arguments: .download(directory="input", update_interval=24). Default values are directory="input" and update_interval=12.

FYI. Please use .download(databases=["covid19dh", "japan", "owid"]). Refer to https://github.com/lisphilar/covid19-sir/issues/1223 and https://github.com/lisphilar/covid19-sir/issues/1224

lisphilar / covid19-sir

[Fix] DataEngineer().subset(country=None) calculates total values without filling NAs #1222

Summary of question

Code