jianxu305 / nCov2019_analysis

Analysis of 2019-nCov coronavirus data
GNU General Public License v3.0
117 stars 66 forks source link

calculation of Cum_confirmed and New_confimed #7

Open minxueric opened 4 years ago

minxueric commented 4 years ago

In coronavirus_demo_colab.ipynb, the example of calculating the cum_confirmed number and new_confirmed number is given by x = daily_frm[daily_frm['province_name_en'] == 'Hubei'].groupby('update_date').agg('sum') In your notebook, you only show the last 7 rows, which looks good.

However if you show all the rows, you will find inconsistence between the two columns cum_confirmed and new_confirmed. The equation cum_confirmed[t] = cum_confirmed[t-1] + new_confirmed[t] does not hold for the first few rows.

Could you please help investigate this problem?

Screen Shot 2020-03-03 at 10 46 34 AM
jianxu305 commented 4 years ago

The relation should not hold for the first row. This is because the data set only starts from 1/24. It does NOT mean 1/23 has zero confirm/dead case. Therefore, the new_confirm, new_dead, etc cannot be computed on 1/24. And the new_confirm, new_dead on 1/24 should not be used.

minxueric commented 4 years ago

Hi @jianxu305 , thanks for you quick response! Yes, I totally agree with you that newly_confirmed/cured/dead is meaningless on 01/24, because we don't have cum_confirmed/cured/dead on 01/23 to accomplish the diff operator.

However, the relation equation should hold exactly after 01/24. The key problem is that it does not hold for 01/25-01/28. Besides, if you change the index list of daily_frm to a larger set, i.e., the whole nation, you can find this problem will be more serious.

Did I make it clear?

jianxu305 commented 4 years ago

Thanks for finding this out. The problem is that some city doesn't have data until later dates. And Pandas 'diff' will always ignore the first one in grouping, so the aggregated "new" cases will be off when a new city appears. I have fixed this problem, and added a doc test. Please pull.

If you feel the problem is solved, please close this issue. Thanks.

minxueric commented 4 years ago

Thank you for this helpful solution. You are right, for those cities where cases happen from a later date, we should add 0 before the first case date to calculate the new_confirmed/cured/dead. This really help alleviate the problem.

However, as I find in the last comment, if we choose the index of daily_frm to a larger range, i.e., the whole nation, the problem still remains unsolved. For example, if I run

x = daily_frm.groupby('update_date').agg('sum')
x

I still get the first several rows with inconsistency between the columns.

Do you know why this happens?

jianxu305 commented 4 years ago

Thank you very much for finding the problem. This is because some "cities" appeared in some days, and then disappeared in some days. For example, "不明地区" in Beijing appears only on 1/24, and not in any future dates. So the 8 confirmed counts in this category cannot be diff correctly.

If you can fix this, you are more than welcome to fix it, and send me a pull request.

Otherwise, I will leave it as it is for now, and revisit this problem when I have time. Thanks.