lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
110 stars 44 forks source link

[Fix] [outdated codes in DataLoader] COVID-19 Data Hub stops updating pre-processed data files #716

Closed eguidotti closed 3 years ago

eguidotti commented 3 years ago

Hi @lisphilar, I would like to notify you that we stopped updating the pre-processed data files at COVID-19 Data Hub, in order to improve the update and storage of the raw data.

I don't know if this may cause some issues in your package. Please switch to the raw data if you are still using the pre-processed files.

Many thanks! Emanuele

lisphilar commented 3 years ago

@eguidotti, Thank you for letting me know kindly!!

As-of 2021-04-13 (today), I can download records to 2021-04-13 with covsirphy.DataLoader.jhu().

However, CovsirPhy uses your Python package covid19dh.covid19(raw=False) internally. Does raw=False mean we use pre-processed data files? If yes, we will switch to raw=True (or access to your raw data files and the list of primary sources directly) and perform pre-processing.

Thank you always!

eguidotti commented 3 years ago

Yes, exactly. Setting raw=True (the default) uses the raw data we are maintaining. Basically you get the same data and format, the only difference is that missing values are not filled as in the pre-processed data files as described on the website https://covid19datahub.io/articles/data.html

lisphilar commented 3 years ago

@eguidotti, Thank you for your confirmation. We will use raw=True and replace NA values with the previous non-NA value or 0 in the next stable version.

@Inglezos, I will merge pull request #717 if no problems in tests. I think it works, but could you double-check changes with the next development version 2.19.0-alpha? The new stable version 2.19.1 will be released just for this issue.

Inglezos commented 3 years ago

Generally it seems to be working normally. I noticed only some findings for two countries, if you have some time please check these: Brazil PCR plot (totally wrong %, >100) Sweden records plot (C, I, F, R) (many ending unupdated confirmed cases)

Note: These probably are not related to this source data change, but it would be good to find out why they are happening (especially the pcr one)

lisphilar commented 3 years ago

@Inglezos, Thank you for your reply. I compared outputs of 2.19.0 and 2.19.0-alpha in Google Colab. Because no differences in versions (except for the last date of records: this is expected, according to COVID-19 Data Hub documentation), we can release 2.19.1 today.

2.19.0: https://gist.github.com/lisphilar/aa0b2ac71e3624712ea0d5f043d606fe 2.19.0-alpha: https://gist.github.com/lisphilar/31500800de7c9d16df1f5de705d26464

As you said, we have new issues for Brazil and Sweden. If I understand the problems correctly, can we discuss them in new separate issues?

Brazil PCR plot (totally wrong %, >100)

A spike of PCR positive rate was found around Feb2021. This may be an issue with complement methods of PCRData.

Sweden records plot (C, I, F, R) (many ending unupdated confirmed cases)

Yes, "Confirmed" and "Fatal" are constant values, 813191 confirmed cases and 13466 fatal cases. WHO data says 876,506 cases and 13,660 cases respectively on 14Apr2021.

Dear @eguidotti, Could you provide data source for Sweden in COVID-19 Data Hub? I found a link of Public Health Agency, Sweden on documentation, but this did not work today. Sorry for bothering you again.

With an issue in your repository, is it possible to change the data source to WHO data or the other primar sources for Sweden?

Inglezos commented 3 years ago

A spike of PCR positive rate was found around Feb2021. This may be an issue with complement methods of PCRData.

Besides the weird spike, the main problem I notice is that many PCR values are above 70-80% which seems improbable. Except if the tests are indeed few and the positivity is huge actually.

lisphilar commented 3 years ago

"Our World In Data" also reported 50% - more than 100%. https://ourworldindata.org/coronavirus/country/brazil

It is certain that the situation is tense and > 50 is the fact. > 100% appears a data collecting error. When the root cause is ambiguous, our data cleaning tool covsirphy.DataLoader is expected to show the data without complement.

lisphilar commented 3 years ago

Released CovsirPhy version 2.19.1 in PyPI. This is the same as 2.19.0-alpha.

eguidotti commented 3 years ago

Hi @lisphilar, thanks for finding this out! They changed the URL structure for Sweden. The new URL is https://www.dataportal.se/en/datasets/525_1424/ I fixed this in the data sources (please allow some time for the automatic update to complete).

With an issue in your repository, is it possible to change the data source to WHO data or the other primar sources for Sweden?

Is there some problem with our data for Sweden? I'd be happy to fix that!

lisphilar commented 3 years ago

Thank you @eguidotti for your supports and I successfully downloaded a CSV file from the new URL to check the primary data today (2021-04-16). Unfortunately, the last date of the records was "2021-03-31" (column AB of CSV file).

As @Inglezos reported, the number of confirmed/fatal cases for Sweden have not been changed (813,191 cases and 13,466 cases respectively) since 2021-03-31 in the dataset retrieved from COVID-19 Data Hub using your covid19dh via our library. This may be due to the un-updated primary source.

WHO data reported 892,480 confirmed cases and 13,761 fatal cases on 16Apr2021.

eguidotti commented 3 years ago

Thanks @lisphilar and @Inglezos for the information. I double-checked and it is actually due to the provider not updating the data. It seems they changed (once again) the data file they are maintaining. The official open governmental data for Sweden seem to be released and updated here now.

I should have fixed this. The data are (very) slightly different from the ones published previously. In a couple of hours you should be able to get the latest data for Sweden from the data hub as usual. Please let me know if you find any other issue about that.

lisphilar commented 3 years ago

@eguidotti , thank you very much for your help! I confirmed that the problem for Sweden was fixed.

Now no issues regarding data. I will close this issue and many thanks always for your updating & maintaining COVID-19 Data Hub!!