error when aggregate daily records

jianxu305 / nCov2019_analysis

Analysis of 2019-nCov coronavirus data

GNU General Public License v3.0

117 stars 66 forks source link

error when aggregate daily records #3

Closed YiranJing closed 4 years ago

YiranJing commented 4 years ago

Hi @jianxu305,

Thanks for your package first!

I use your function below to generate daily data from ding xiang yuan.

DXYArea = utils.load_chinese_data() # Query latest Regional Data from DXY
daily_frm_DXYArea = utils.aggDaily(DXYArea) # generate daily data

From the plot of Henan province, the comfirmed cases decrease in 2020-02-04. Is some error within function utils.aggDaily?

Thanks for the help :)

XsLee commented 4 years ago

青海，新疆，内蒙，甘肃，重庆，海南，北京，上海有类似情况

jianxu305 commented 4 years ago

Thanks for spotting this. I found the problem. In the CSV file provided by another github project (https://github.com/BlankerL/DXY-2019-nCoV-Data) on Feb 3, 2020 has some problems. It contains many duplicated cityNames such as "南阳" and "南阳（含邓州）", "商丘" and "商丘（含永城）".

I will build a guard against this input asap. Stay tune.

xinkaifu commented 4 years ago

Hey Jian,

Thanks for sharing this package. I also noticed similar issues on cured number dropping on some days for some provinces. The problem was that cityName is not a unique ID, especially for cases like '待明确地区’ which is shared by many provinces. The fix would be including provinceName in groupby below:

for key, frm in df.drop(columns=drop_cols).sort_values(['updateDate']).groupby(['cityName', 'updateDate']):

Thanks, Xinkai

xinkaifu commented 4 years ago

Actually, adding from cities to compute province-level data needs to be handled better. For example in Shanghai you have cityName called 外地来沪人员, 待明确地区 and 未知地区, and there should be some overlap between these. The total doesn't add up to the province level data.

jianxu305 commented 4 years ago

The problem of double counting "南阳" and "南阳（含邓州）" in province aggregation is fixed. Please double check.

I will take a look at the 未知地区 problem.

jianxu305 commented 4 years ago

“未知地区” is no longer shared by different province anymore. Thanks for pointing it out.