Inglezos commented 3 years ago

Summary of question

Are there data for the daily tests that are done per country? In order to have % of positive tests daily for example.

lisphilar commented 3 years ago

Records are available in the raw dataset of COVID-19 Data Hub. New data cleaning class and a method of DataLoader are required.

lisphilar commented 3 years ago

How do you calculate the positive rate? I supposed Tested.diff() per Confirmed.diff() with the assumption that the number of confirmed cases is equal to that of PCR-positve cases.

Inglezos commented 3 years ago

Yes exactly that! The rate is considered as the daily PCR confirmed cases / the total daily PCR tested cases. Since we have no other way to find PCR-only confirmed, it is the only way to use the daily announced confirmed data for the testing rate, assuming they are PCR-tests. Is this achievable for v2.13?

lisphilar commented 3 years ago

(In my previous comment) Not correct: Tested.diff() per Confirmed.diff() Correct: Confirmed.diff() per Tested.diff()

Yes this will be included in v2.13, but please create a pull request for this issue, if possible.

create new file covsirphy/cleaning/pcr_data.py with PCRData class. (file/class name can be changed)
PCRData is a sub class of CleaningBase class and overwrite _cleaning method and positive_rate method.
Raw dataset is COVID-19 Data Hub dataset
COVID19DataHub class in covsirphy/cleaning/covid19datahub.py and DataLoader class will be updated.
Test codes will be written in tests/test_pcrdata.py, not tests/test_dataloader.py, because I'm editting test_dataloader.py with #391

lisphilar commented 3 years ago

Additionally, please update covsirphy/_init.py so that we can use the class with import covsirphy as cs; cs.PCRData.

Inglezos commented 3 years ago

I am trying to have a first implementation, but I have some questions:

What we need to do essentially is to keep the extra "tests" column from the covid19datahub dataset in order to use this for the positive rate.
One first question that occurred to me is why to write a new class and file for that, since we could simply adapt the current jhu implementation to keep the extra column. But okay, for simplicity reasons we should not mix these two, right?
The PCRData class will be similar to the JHUData? The dataset, to which the PCRData will correspond, will be the same to JHU, but with the extra column? So the final kept columns will be Date, Country, Province, C, CI, F, R and Tests?
Which methods are reasonable to be kept for PCRData? Because if this is similar to JHU data, then we must keep all the replace(), subset(), records(), closing_period(), subset_complement() etc methods. Of course plus a new positive_rate() method.
The way the covid19dh.csv is downloaded has to be changed and be necessary to force download it every time either JHU or PCR is called, because for JHU raw=false, while for PCR has to be raw=true. And if we assume that first we have executed JHU, then the csv exists but doesn't contain the "tests" column. Then if we call PCR, it will determine that is not necessary to download the dataset but load will fail due to wrong columns. So I think that we should change _download() in covid19datahub.py and use always raw=true. Because I think this does not affect JHU after all.

and a major issue: Some countries have missing tests records. How should we handle this? I thought we could fill these with interpolated values.

lisphilar commented 3 years ago

What we need to do essentially is to keep the extra "tests" column from the covid19datahub dataset in order to use this for the positive rate.

Yes. self._cleaned_df will have "Date"=self.DATE, "Country"=self.COUNTRY, "Province=self.PROVINCE", "Tested" and "Confirmed" column.

One first question that occurred to me is why to write a new class and file for that, since we could simply adapt the current jhu implementation to keep the extra column. But okay, for simplicity reasons we should not mix these two, right?

Just for refactoring. JHUData class has too many methods. It will be easy to find issues and to add new features when classes have simple codes. (Because of this, I moved complemention codes from JHUData class in #382.)

The PCRData class will be similar to the JHUData? The dataset, to which the PCRData will correspond, will be the same to JHU, but with the extra column? So the final kept columns will be Date, Country, Province, C, CI, F, R and Tests?

self._cleaned_df will have "Date"=self.DATE, "Country"=self.COUNTRY, "Province=self.PROVINCE", "Tested" and "Confirmed" column. CI, F, R will be ignored because we need only ested and confirmed data for positive rate calculaition.

Which methods are reasonable to be kept for PCRData? Because if this is similar to JHU data, then we must keep all the replace(), subset(), records(), closing_period(), subset_complement() etc methods. Of course plus a new positive_rate() method.

Because we use only tested and confirmed cases, most of methods will not be included. Please do not start with JHUData. Please start with CleaningBase class. PCRData class is a direct child class of CleaningBase class. __init__, _cleaning will be overwriten. subset_complement and records will be newly created to complement values, if necessary.

lisphilar commented 3 years ago

The way the covid19dh.csv is downloaded has to be changed and be necessary to force download it every time either JHU or PCR is called, because for JHU raw=false, while for PCR has to be raw=true. And if we assume that first we have executed JHU, then the csv exists but doesn't contain the "tests" column. Then if we call PCR, it will determine that is not necessary to download the dataset but load will fail due to wrong columns. So I think that we should change _download() in covid19datahub.py and use always raw=true. Because I think this does not affect JHU after all.

In cleaning.covid19datahub.COVID19DataHub class, we need to update class variable OBJ_DICT and _retrieve method. In _retrieve() method, please add "tests" to col_dict variable.

Some countries have missing tests records. How should we handle this? I thought we could fill these with interpolated values.

Yes, we can use interpolated values in subset_complement and records method. positive_rate() method will call records() method.

Inglezos commented 3 years ago

I have some issues with missing test records. The complement is problematic. The data format for example can be (denote with x any valid value):

x x x nan nan nan x x x x nan x x

What kind of complement solution could be applied? Do you have any code suggestions? I tried both partial complement and non monotonic but they both failed to produce what might seem valid values to me. Valid values for example would be (denote with v the new generated value, assuring monotonic increase):

x x x v v v x x x x v x x

The difficult part was to interpolate the individual nan at the end (same problem when Japan last row was 0), keeping all values strictly monotonic increasing (no double values), while generated tests being kept far many more than confirmed. Because at best, the values were interpolated but they give 50% positivity because the diff from the previous is small compared to confirmed.

lisphilar commented 3 years ago

Did you try interpolate with spline, order=1?

Inglezos commented 3 years ago

Yes both with spline order 1 and linear. The first problem was that, but I think I solved it. The second one was the high rate due to small interpolated data. Perhaps it would be better for me to create a first pull request, so you check what I have done and leave the complement for last to rework. I will do this after evening today.

lisphilar commented 3 years ago

Thank you! I will check it.

lisphilar commented 3 years ago

(Memo: interpolate .diff values, not Tested itself)

Inglezos commented 3 years ago

Please have a look at the changes so we can discuss them, as well as how to handle complements. What shall I do to fix these github warnings/errors?

lisphilar commented 3 years ago

Thank you! I could not find fatal error in PCRData class, but tests failed. I think this is because c_res = covid19dh.covid19(country=None, level=1, verbose=False, raw=False) as I mentioned in the review comments. Is this required?

lisphilar commented 3 years ago

407 and follow-up #407 were merged.

Which country need partial complement on PCR dataset? This should be tested in "tests/test_datahub.py::TestPCRData::test_positive_rate()".

At this time, the figure of PCRdata.positive_rate(country="Greece") is pcr_positive_rate_Greece

lisphilar commented 3 years ago

With closing this issue, I will release version 2.13.0.

lisphilar / covid19-sir

Records for daily tests executed #389

Summary of question

407 and follow-up #407 were merged.