ulklc / covid19-timeseries

Covid19 timeseries data store
MIT License
38 stars 9 forks source link

Australian data doesn't match official source #12

Closed chrisjbillington closed 4 years ago

chrisjbillington commented 4 years ago

AU/Australia

Confirmed: {2020/03/29} - 4093

Source

https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/coronavirus-covid-19-current-situation-and-case-numbers

I'm not sure where your Australian numbers are coming from, since the link in the README (https://www.health.gov.au/news/coronavirus-update-at-a-glance) redirects to a page that does not have case numbers. Perhaps your numbers are more up-to-date (yours are 4163 at the moment, higher by 70), but if so I can't see them anywhere.

If you published your scraping scripts, then I'd better be able to determine if this is an actual bug or if I just haven't found the latest data :).

chrisjbillington commented 4 years ago

Your data yesterday agreed with the UN Situation report at 3635 active cases, but disagrees with their number today, which is 3966 (also different to health.gov.au, presumably since it's less up-to-date...?).

chrisjbillington commented 4 years ago

I see the number 4163 on worldometers, but I can't find it in any official source, which all say 4093.

I wonder if might be a typo, 4163 and 4093 are only two fat finger errors on a numpad away from each other.

ulklc commented 4 years ago

Hi Chris, Right now scraping script is reading 3 source.

  1. DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia.
  2. BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
  3. WorldoMeters: https://www.worldometers.info/coronavirus/

It saves the highest value found in these sources. If we consider that the script works with the GMT time zone and Australia's time zone is GMT +8/+12, it is possible that the script will make errors on a daily basis. Because the sources I mentioned above do not share information historically. And we try to get historical information from these sources. And as a working principle, we take photos of these resources at the end of each day. If you think there is an error in historical information, we can fix it quickly. However, as I mentioned above, some data has shifts due to the time zone.

Thanks

chrisjbillington commented 4 years ago

OK, looks like worldometers must be summing more up-to-date data from the individual states to get a higher number than available from the national count on health.gov.au. So that's fine, I will just assume this is OK.

philipstarkey commented 4 years ago

I'm concerned at the worldometer data for Australia. For example, it currently (7th Apr, 11:15am EST) lists the number of cases as 5895 with 2432 recovered. An increase of 145. Note that all of the sources are from yesterday.

But the official count as of 7th Apr, 6am EST (6th Apr, 22:00 GMT) says:

As at 6:00am on 7 April 2020, there have been 5,844 confirmed cases of COVID-19 in Australia. There have been 100 new cases since 6:00am yesterday.

So somehow worldometer, using sources from the 6th April (EST), ended up with a total higher than the official count on the 7th April EST. Either the official count is wrong compared to a sum of state counts, or worldometer is relying on a cumulative total of cases as reported on twitter/news sites which is likely to be very prone to error (if for example it gets reported wrongly one day, and worldometer doesn't pick up on the correction). I note that none of the sources listed by worldometer have the total count - they all refer to daily increases as far as I can see. Perhaps their listed sources are incomplete - but then that's also worrying for other reasons...

Note also that an official infographic (3pm EST, 6th Apr) lists 5795 cases and 2432 recovered. The recovered cases match worldometer but the total cases do not. The previous days infographic shows 5687 cases and 2315 recovered. That's an increase of 108 in 24 hours. Significantly less than what worldometer reports (145). Archived inforgraphics:

It looks like worldometer daily increases do match this infographic (at least for the last few days) but that they are perhaps a day out of sync?

I don't know that there are any easy answers here, but there are enough odd things about the worldometer data for me to distrust it. Particularly the relationship between recovered cases and total cases and whether the total cases has an offset due to the way they are counting cases. I'm not sure the timezone offsets between reports/scripts running etc. can explain these discrepancies.

chrisjbillington commented 4 years ago

My suspicion was that Worldometer is using data from the individual states rather than the national totals. I haven't checked, but it makes sense that they would be a bit ahead of the national totals.

philipstarkey commented 4 years ago

Perhaps, but I don't think the same applies for the recovered case number. Which means the cases vs recovered relationship is wrong (at the very least)

chrisjbillington commented 4 years ago

One of the 'source' links from worldometers says:

Note that national daily and cumulative counts are likely to be lower earlier in the day, as some states/territories will not have reported their figures yet

Data on recoveries seems to often not be reported for days at a time anyway, so it might be the case that the recoveries numbers on worldometers agree with the 'older' national figures because no recoveries have been announced that day at any level.

Edit: Actually the quote doesn't say what I thought it did. Maybe disregard.

philipstarkey commented 4 years ago

Historically recoveries has not been reported well, but recoveries will now be reported daily at the Aus Federal level (as of Sunday). To me it makes sense to use the associated total cases reported at the same time if you are going to use the number of recoveries. Worldometer is not doing that - they're pulling in data from multiple sources for some reason.