Closed YiranJing closed 4 years ago
青海,新疆,内蒙,甘肃,重庆,海南,北京,上海有类似情况
Thanks for spotting this. I found the problem. In the CSV file provided by another github project (https://github.com/BlankerL/DXY-2019-nCoV-Data) on Feb 3, 2020 has some problems. It contains many duplicated cityNames such as "南阳" and "南阳(含邓州)", "商丘" and "商丘(含永城)".
I will build a guard against this input asap. Stay tune.
Hey Jian,
Thanks for sharing this package. I also noticed similar issues on cured number dropping on some days for some provinces. The problem was that cityName is not a unique ID, especially for cases like '待明确地区’ which is shared by many provinces. The fix would be including provinceName in groupby below:
for key, frm in df.drop(columns=drop_cols).sort_values(['updateDate']).groupby(['cityName', 'updateDate']):
Thanks, Xinkai
Actually, adding from cities to compute province-level data needs to be handled better. For example in Shanghai you have cityName called 外地来沪人员, 待明确地区 and 未知地区, and there should be some overlap between these. The total doesn't add up to the province level data.
The problem of double counting "南阳" and "南阳(含邓州)" in province aggregation is fixed. Please double check.
I will take a look at the 未知地区 problem.
“未知地区” is no longer shared by different province anymore. Thanks for pointing it out.
Hi @jianxu305,
Thanks for your package first!
I use your function below to generate daily data from ding xiang yuan.
From the plot of Henan province, the comfirmed cases decrease in 2020-02-04. Is some error within function
utils.aggDaily
?Thanks for the help :)