Inglezos commented 3 years ago

Summary of question

I wonder if we can make substantial analysis for the initial major pandemic wave in China and its provinces during February-May. I mean to analyze the data and fit then in SIRF model. Is that possible, considering that the initial phase is problematic during analysis and has big error? Could you share the code for such provinces analysis for that period?

lisphilar commented 3 years ago

Sorry, could you please explain your question in a different way because I could not understand the question? Please share the codes and outputs you got the question with.

Inglezos commented 3 years ago

First of all, subset() fails to load correct recovered data for China. Please run this country analysis to double-check this with me.
Apart from that, let's say I want to run for China and for some provinces, analysis for the period January - March 2020, when the pandemic reached its peak in China. But the first phase is probably until late February so the time window of the data of that pandemic wave is too narrow to run substantial analysis. Is this correct?

lisphilar commented 3 years ago

First of all, subset() fails to load correct recovered data for China. Please run this country analysis to double-check this with me.

JHUData.subset("China") raises ValueError, but JHUData.subset_complement("China") returns records. This is because recovered data is not registered in the raw dataset as well as Netherlands #334 . https://gist.github.com/lisphilar/e5e61a8a9b5acab62bc635d46d200323

But the first phase is probably until late February so the time window of the data of that pandemic wave is too narrow to run substantial analysis. Is this correct?

What is the problem? Why do you think? Please share the codes or figures to discuss.

Do you mean the first phase in China is too long?
Do you mean it is difficult to estimate and discuss parameter values because outbreak did not show complete end?

Note for the first one: Long phase does not mean useless analysis. In the initial phase, parameter values could be stable because we did not know how to prevent outbreak. Long phase only means that parameter values were not changed.

Note for the second one: To cover the weak point of conventional SIR model analysis with fixed parameter values for long period, I introduced "phases". Phase-dependent SIR analysis allows change of parameter values.

Inglezos commented 3 years ago

This is because recovered data is not registered in the raw dataset as well as Netherlands

Oh yes, you're right, the actual recovered data are missing for China as well.

The problem for China is that the analysis/simulation is made after 23Febr20, because the initial phase is skipped. The thing is that the first month was very critical for the China pandemic and if I want to fit to the model the first month (10Jan-28Febr) I cannot because of that. Is this correct?

What I want to do for China's provinces is something like this (see plot images) arxiv.org/pdf/2002.07572.pdf https://github.com/benmaier/COVID19CaseNumberModel

Can I run full analysis for these provinces for example? i) For Jan-Febr and ii) for after February? I think provinces are supported right?

Also I notice that for the initial phase in China, the Rt = 0.22 . How is this possible, since the pandemic began from there and such chaos was created due to high contagiousness of the virus? I mean I would expect Rt to be at least 2.5-4 in the beginning, when no measures were in effect.

On the other hand, now it says Rt = 1.5, which seems impossible, considering that in China nowadays the daily new cases are ~10-20. How come that be valid and representative of the current situation? Does it mean perhaps something else?

Also, I see from the graphs where actual and simulated cases are compared, that they differ only a little bit after the first phases. Does this mean that the model is very good or that we have something like over-fitting of the data to the model?

lisphilar commented 3 years ago

The problem for China is that the analysis/simulation is made after 23Febr20, because the initial phase is skipped. The thing is that the first month was very critical for the China pandemic and if I want to fit to the model the first month (10Jan-28Febr) I cannot because of that. Is this correct?

It may be necessary to revise algorithm of complement.The first recovery date in China is estimated as 23Feb2020 with the following causes.

The first confirmation date of China in the dataset is 04Jan2020,
the calculated mode values of closing period (time to get outcome) is about 27 days, and
the number of confirmed cases incleased sharply in Jan2020.

We can

adjust value of closing period to fit the situation and/or
use moving average values of confirmed cases when complement recovered data.

Do you have time to investigate JHUData._complement_recovered_full() method?

Can I run full analysis for these provinces for example? i) For Jan-Febr and ii) for after February? I think provinces are supported right?

Yes, provinces are supported, like cs.JHUData.subset(country="China", province="Hubei") andcs.Scenario(jhu_data, population_data, country="China", province="Hubei"). Country name is required, but province name is optional in the methods.

https://gist.github.com/lisphilar/f1ec31e4a5078846612a4f6ce29ee1d5 (WIth Hubei data, estimation failed in the last phases because the number of infected cases is small. This should be fixed with a new issue.)

Also I notice that for the initial phase in China, the Rt = 0.22.....Does this mean that the model is very good or that we have something like over-fitting of the data to the model?

Root cause would be the algorithm of complement.

Inglezos commented 3 years ago

Those linelist data when will be utilized? I thought that we would use them for generating the missing recovered cases.

I checked the _complement_recovered_full() method as well as calculate_closing_period(). Only improvement I tried was to use inside of calculate_closing_period(), the std() or quantile() instead of mode value for closing day and a rolling mean of 3 days for Confirmed cases: df[self.C] = df[self.C].rolling(window=3).mean() return df["Elapsed"].std().astype(np.int64) But this unfortunately results for China analysis in low infected cases peak and high recovered; is just like every new confirmed goes directly to recovered compartment.

***See update below.

Inglezos commented 3 years ago

[UPDATE:] I tried something else (without any of the changes mentioned in my previous comment). In _complement_recovered_full(), the current code rejects all the negative calculated recovered cases. What if we utilized them with their absolute value and sorted? I modified the code to the following:

        # Closing period
        self._closing_period = self._closing_period or self.calculate_closing_period()
        # Estimate recovered records
        shifted = df[self.C].shift(periods=self._closing_period, freq="D")
        df[self.R] = shifted - df[self.F]
        df.loc[df[self.R] < 0, self.R] *= -1
        df[self.R].interpolate(method="spline", order=1, inplace=True)
        df.loc[:, self.R] = sorted(df[self.R])
        df[self.R] = df[self.R].fillna(0).astype(np.int64)

The result from this, for China, is the start date to be 28Jan20, which is very good and the simulation has also good results. The only problem is some spikes and some non-monotonic increase of simulated fatal and recovered cases that sometimes occur. This is a problem that we already have discussed in another issue.

But what if we simply sort the simulated cases? Then it would be resolved. I mean by putting in scenario.simulate() the following code:

        tracker = self._create_tracker(name)
        try:
            sim_df = tracker.simulate(y0_dict=y0_dict)
            sim_df.loc[:, self.C] = sorted(sim_df[self.C])
            sim_df.loc[:, self.F] = sorted(sim_df[self.F])
            sim_df.loc[:, self.R] = sorted(sim_df[self.R])
            sim_df[self.CI] = sim_df[self.C] - sim_df[self.F] - sim_df[self.R]

Generally I think that it would be a good idea to sort the complementary recovered cases (either partially or fully), as well as the simulated confirmed fatal and recovered cases. And I think it doesn't cause any problems, because it affects specific values, which would exist anyway. It is like we redistribute these values smoothly back to the total simulated cases. Wouldn't that solve these non-monotonic spikes we talked about and smooth the cases? I'm running with these code modifications multiple country analyses and simulations and I observe good results in general. Please double-check this whenever you have some time.

lisphilar commented 3 years ago

Thank you for your detailed investigation and simulation with new argument sorted=True will improved the simulated number of cases greatly.

However, sorting in Scenario.simulate() may mask the root cause, in-accurate parameter estimation in some phases, without solving it. Sorting will replace simulated number of cases over phases. RMSLE scores fo Scenario.score() and figures of Scenario.simulate() and Scneario.history("Infected") etc. will be greatly improved, but parameter values are still in-accurate in some phases. The number of cases is only observal variable and parameter values have physical meanings in SIR-derived model.

On the other hand, we can use sorting tools in JHUData._complement_recovered_full() because this is free from ODE models. The purpose of this method is to reflect real situations on the datasets. df.loc[df[self.R] < 0, self.R] *= -1 (or df[self.R] = df[self.R].abs()) and df[self.R] = sorted(df[self.R]) (or df[self.R].sort_values(inplace=True)) will be effective solutions.

Those linelist data when will be utilized? I thought that we would use them for generating the missing recovered cases.

We can use linelist data as follows. #281

linelist = data_loader.linelist()
all_df = linelist.cleaned()
chn_df = all_df.loc[all_df["Country"] == "China"]

Inglezos commented 3 years ago

Yes I agree that sorting in Scenario.simulate() might camouflage the root cause; we need to eliminate the underlying problem. I will make soon a pull request for sorting the recovered cases (in full complement as well as in subset_complement).

Regarding the linelist, I thought that there was already a method that was using these linelist data in order to extract approximated recovered data, similar to the complementary method but more accurate.

lisphilar commented 3 years ago

Thank you.

360 and #362 will be closed to use linelist data like this.

linelist = data_loader.linelist()
df = linelist.closed("China", outcome="Recovered")

Inglezos commented 3 years ago

So with this new method we will calculate the closing period in days from these linelist data per country?

lisphilar commented 3 years ago

We could use linelist data, but only 36 records were found with linelist.closed("China", outcome="Recovered"). It will be difficult to use it for complement.

Confirmation_date	Recovered_date
2020/1/21	2020/1/24
2020/1/21	2020/1/29
2020/1/24	2020/2/3
2020/1/25	2020/2/12
2020/1/26	2020/2/5
2020/1/26	2020/2/5
2020/1/26	2020/2/5
2020/1/27	2020/2/7
2020/1/31	2020/2/10
2020/2/2	2020/2/14
2020/2/5	2020/2/13
2020/2/4	2020/2/13
2020/2/4	2020/2/18
2020/2/8	2020/2/17
2020/2/3	2020/2/22
2020/1/25	2020/2/2
2020/1/24	2020/2/2
2020/1/24	2020/1/28
2020/2/3	2020/2/22
2020/1/25	2020/2/2
2020/1/24	2020/2/2
2020/1/26	2020/2/5
2020/1/26	2020/2/5
2020/1/26	2020/2/5
2020/1/24	2020/1/28
2020/1/21	2020/1/24
2020/1/27	2020/2/7
2020/1/24	2020/2/3
2020/1/31	2020/2/10
2020/2/2	2020/2/14
2020/1/21	2020/1/29
2020/2/5	2020/2/13
2020/2/4	2020/2/13
2020/2/4	2020/2/18
2020/1/25	2020/2/12
2020/2/8	2020/2/17

Inglezos commented 3 years ago

Yes I think the modus is around 10 days. I edited my previous comment, the complementary method shall be kept, but the closing period will be affected from the linelist data?

lisphilar commented 3 years ago

Because the number of records is small, it is not recommended to use the linelist data for complement.

With my notebook created on 19Nov2020, I found that the distribution of closing period was bimodal (10 days and 30 days). https://gist.github.com/lisphilar/9c5f0a96c88a27be0c3e085ced89a2b1

index

I do not have time in some days, but it will be good if we have reasons we use 10 days as representative value of closing period. I wonder whether closing period is different that in the first months of outbreak and that in these days or not.

Inglezos commented 3 years ago

Hmmm could this mean one period for the fatal closing date and one period for the actual recovered date? I mean, during closing period calculation, we currently use the sum of fatal and recovered. Maybe that causes a bimodal behavior, since we would expect fatal and recovered to have different closing period. Nevertheless, 30 days would be prohibiting to use in complementary method, since this ignores a complete month of analysis. And as we saw, for China at least, that first month is crucial.

How about if we used the linelist data for all the countries, instead of a specific country? That's how that density plot was made in the first place after all, over a global analysis, not for a specific country right? Maybe it will be preferable then to use all the linelist data for all the countries in order to extract a single closing period to use in our complementary method.

lisphilar commented 3 years ago

Thank you, we can use "Confirmed minus Faral" and "Recovered" rather than "Confirmed" and "Recovered plus Fatal".

Inglezos commented 3 years ago

I don't think that this solution leads to a better result either though. It's the same more or less.

Maybe for calculating the closing period, it would be better to find an average solution, between the current method with pivot table and the linelist data analysis. And the final closing period could be a mean value or something like that.

Perhaps the linelist data are more accurate than the publicly reported recovered cases per country, because they describe specific patients history.

Maybe the recovered cases are reported with some delay period, a latency between the actual and reported recovered dates, because of the 14-day quarantine-confirmation period, which could camouflage the real recovering period, between symptoms onset/positive test and actual recovery. So for example, one individual could be infected on day0, have positive test on day4, develop symptoms on day7 and be better on day 16 (no fever or coughing), with total recovery on day20 (not a single symptom besides a little fatigue maybe). Normally this example's recovery period would be 16-4 = 12 days or 20-4 = 16 days. So 12 or 14 is acceptable for using as closing period in our analysis. But sometimes the officially reported recovered case could be several days (i.e. additional 10 days in average) after the symptoms have disappeared, so that would mean 20+10 = 30 and thus 30-4=26 days, which is very close to the currently calculated modus value.

Another cause of such delay could be the interval between the tests that confirm that someone has recovered. Maybe he has to wait few days before he can go to a hospital and get the test and then wait for the results.

Actually, the recovery date is a little bit deceiving. Because the actual virus cycle in the human body depends on many factors, such as age, patient history and symptoms severity. Especially when a person gets in the ICU, then this recovery period is artificially prolonged. Otherwise normally, in a shorter time this case would probably be a fatal case. So maybe these large recovery periods of 30+ days indicate case severity and older age, with hospital assistance. I don't know if the respective closing period of these data is representative of the general recovery period. I would imagine the recovery period as the following: a person gets sick, he becomes aware of it after few days, stays at home, has symptoms for many days, probably up to 10 days and then he gets better and recovers, for sure in less than 20 days. It seems to me illogical the recovery to last for more, let alone for 36+ days sometimes! Either we do something wrong in the calculations inside calculate_closing_period or there is indeed such a latency that I describe. In case such a latency actually exists, then we could just simply subtract a predefined value (i.e. 10 days) from only such large values, over 3 weeks.

Moreover, another thought is that these large recovery periods (over 3 weeks I mean) correspond perhaps to secondary infections and a second virus cycle from the beginning.

This is not an easy topic. Of course first we need to answer "what is actually considered as a recovered case?"

[UPDATE:] Of course, now it occurs to me that, there could exist indeed two recovery periods simultaneously, depending on the age, symptoms severity and patient's history. And I think this is known in general. The asymptomatic individuals, the healthy young people and those who have mild symptoms, I think, tend to have shorter recovery period, around the first modus (9-12 days), while the severe symptoms cases or the older ones or those with other pre-existing illnesses, usually have longer and more difficult recovery periods (over 3 weeks).

lisphilar commented 3 years ago

364 was merged to sort recovered values.

lisphilar commented 3 years ago

Note: With linelist data, recovery/fatal period, not closing period, were as follows with unimodal distribution.

(Median) Recovery period: 12 days Fatal period: 4 days

from matplotlib import pyplot as plt
import covsirphy as cs
data_loader = cs.DataLoader()
linelist = data_loader.linelist()
# Recovery period
r_df = linelist.closed(outcome="Recovered")
r_diff_series = (r_df["Recovered_date"] - r_df["Confirmation_date"]).dt.days
r_diff_series.plot.kde()
plt.show()
r_diff_series.median()

https://gist.github.com/lisphilar/e41f58f7f6c9445a8523b6029bf202c6

Inglezos commented 3 years ago

Yes I think 12 days is more representative of the recovery period for the mild symptoms cases. What are your thoughts on my detailed comment above from one week ago? I think there are two recovery periods but I don't know how we can differentiate between these and which one to choose. Also we have to decide whether we should use in closing period calculation the linelist data.

lisphilar commented 3 years ago

With this analysis, we can add two arguments recovery_period=12 and fatal_period=4 to JHUData._complement_recovered_full() and revise codes of this method.

Inglezos commented 3 years ago

Yes linelist data should be used in closing period but not as hardcoded value, but on-the-fly calculation, since 12 is dynamic. Fatal period is useful?

lisphilar commented 3 years ago

Sorry, fatal_period is un-necessary for complement.

We can use linelist data for complement, but we need to force users to download linelist data with DataLoader.jhu(). This makes users confused and we may need to document it as stdout of DataLoader.jhu().

Inglezos commented 3 years ago

Could we do this internally without extra user action? Otherwise yes, we should document this as prerequisite.

lisphilar commented 3 years ago

With #384,

LinelistData.recovery_period() calculates median value of recovery period.
DataLoader.jhu() will download linelist data as well as the dataset of COVID-19 Data Hub are returns JHUData instance, calling DataLoader.linelist() internally.
DataLoader.jhu() will set .recovery_period property with the value of recovery period [days] calculated with LinelistData.recovery_period().
JHUData._complement_recovered_full() uses the expected value of recovery period set with .recovery_period property.

lisphilar commented 3 years ago

Can we close this issue?

Inglezos commented 3 years ago

Yes yes we can close this.

Inglezos commented 3 years ago

One question, calculate_closing_period() actually is no longer used in calculations right? Since self._recovery_period is always valid.

On second revision, I think it is a good idea to use self._recovery_period combined with calculate_closing_period() for the final recovery period value. That is because valid linelist data are too few (only 277 values), compared to the total cases, they might not be very representative. That combination would mean for example an averaging of their mode values or averaging of their mean values or something like that.

lisphilar commented 3 years ago

When JHUData instance is created by DataLoader, we always use recovery period calculated by linelist. When JHUData instance is created by class method JHUData.from_dataframe() with local CSV file, we use .calculate_closing_period().

This should be discussed in a new issue, but combination of recovery period calculated by linelist and recovery period (not closing period) by JHU dataset will be a good idea.

lisphilar / covid19-sir

Calculate recovery period with linelist to complement recovery data in China during initial phase (Jan-Mar) #357

Summary of question

360 and #362 will be closed to use linelist data like this.

364 was merged to sort recovered values.