tomwhite / covid-19-uk-data

Coronavirus (COVID-19) UK Historical Data
http://tom-e-white.com/covid-19-uk-data/
The Unlicense
162 stars 79 forks source link

No cases data for England for latest day? (c.f Wales and Scotland, which have it) #45

Closed timday closed 4 years ago

timday commented 4 years ago

I note the last commit to data/covid-19-cases-uk.csv is "Update for 2020-04-27 for England, using new process". However, there is no data for any England region for 2020-04-27 in the file, although Scotland and Wales do have data for that date, and England data for 2020-04-26 and before is still present.

tomwhite commented 4 years ago

I'm getting the data from the cases CSV file from the PHE dashboard (https://coronavirus.data.gov.uk/). The latest data in there is for 2020-04-26, so that's what appears in this repo. Before I was just getting the latest data and assigning it to the current date, which was not necessarily the correct thing to do.

Interestingly, the the deaths CSV file does have data for 2020-04-27.

airallergy commented 4 years ago

@timday There are now divergences between the date types used by different statistics from varied data sources.

As to positive case numbers of England and Wales, the terminology is Specimen Date, which indicates the date of the first positive specimen of any tested individual in the lab. In contrast, other data, such as death figures aforementioned by @tomwhite, use Reporting Date or similar ones, indicating the date that the data were published by the government after receiving them in batches from the lab, which contain results from many different specimen dates.

In this sense, there is apparently an inconsistency between the corresponding dates of the latest available data, unless they unify them some day. However, this inconsistency might be mitigated depending on what type of data you are looking at. For example, if the cumulative figures, either in total or in breakdown, concern you, the latest specimen date of the cumulative cases in England is essentially the same thing as the latest reporting date of those in Scotland, as far as I understand.

timday commented 4 years ago

Hmmm.... thanks, interesting.

For the purposes of looking at cases by region across England, Scotland and Wales (and using only days for which data is available for all nations) I'm now wondering whether the "best"/"most realistic" thing to do is either:

or

Not at all clear to me which is more "correct".

It's only the covid-19-cases-uk.csv file I'm looking at, not deaths.

airallergy commented 4 years ago

In terms of cases by region, if you look at Wales historical data csv file provided on this dashboard, England and Wales are actually on the same page (both using Specimen Date), compared to Scotland, as I mentioned above.

What England and Wales are doing on their dashboards is the first option (shifting the latest specimen date to match the latest reporting date of Scotland) you mentioned, which I think is the most realistic approach to check the latest cases breakdown. To justify, this approach can be regarded as the means to obtain the latest available cases breakdown. This also makes sense considering cases with unknown regions.

But do bear in mind that this approach only makes sense subject to 1. latest data and 2. cumulative data. If other types, either historical or daily, are involved, I see no sensible option of consistency across all nations, unless they unify the publishing standard.

timday commented 4 years ago

Another thing I notice from charting the England cases data:

Compare the output from grep 2020-04-27 data/covid-19-cases-uk.csv | grep England | head

2020-04-27,England,E09000003,Barnet,1176
2020-04-27,England,E08000016,Barnsley,607
2020-04-27,England,E09000004,Bexley,597
2020-04-27,England,E08000025,Birmingham,2782
2020-04-27,England,E06000009,Blackpool,413
2020-04-27,England,E08000032,Bradford,796
2020-04-27,England,E09000005,Brent,1330
2020-04-27,England,E06000023,"Bristol, City of",591
2020-04-27,England,E09000006,Bromley,1027
2020-04-27,England,E08000002,Bury,434

with grep 2020-04-28 data/covid-19-cases-uk.csv | grep England | head

2020-04-28,England,E09000002,Barking and Dagenham,448
2020-04-28,England,E09000003,Barnet,1176
2020-04-28,England,E08000016,Barnsley,608
2020-04-28,England,E06000022,Bath and North East Somerset,203
2020-04-28,England,E06000055,Bedford,424
2020-04-28,England,E09000004,Bexley,597
2020-04-28,England,E08000025,Birmingham,2782
2020-04-28,England,E06000008,Blackburn with Darwen,301
2020-04-28,England,E06000009,Blackpool,413
2020-04-28,England,E08000001,Bolton,732

the places listed in both haven't changed at all (or declined by 1 in Barnsley's case). This seems most unlikely given the general rate of increase previously and it looks more like the data from 27th has simply been "reused" on the 28th.

There's also something new going on with some regions becoming more "gappy" (e.g Isle of Wight); I'm sure I'd have noticed that before as it results in gaps appearing in some of my charts which were continuous lines before.

airallergy commented 4 years ago

That is because among all the data of a range of specimen dates the gov is receiving every day, only a fairly small number come from yesterday. No expert here, but I guess this means very few tests can have results in merely one day. This is one of the reasons for the daily revision, so the data of 27/04/2020 would be more reasonable if you look at it on 29/04/2020 than on 28/04/2020. Btw, the revision can affect data of over a month ago, not for every region though.

However, I am not quite sure about tom's current daily update process. I mentioned it in #41, but It seems this revision thing hasn't been fully addressed according to your attached data above. The way I deal with this is to simply overwrite the historical data for England and Wales with new ones on a daily basis.

timday commented 4 years ago

Just looking at today's update. The pattern of the last 2 days' numbers often (not actually done a comprehensive survey) being the same continues e.g grep E08000025 data/covid-19-cases-uk.csv | tail

2020-04-20,England,E08000025,Birmingham,2494
2020-04-21,England,E08000025,Birmingham,2558
2020-04-22,England,E08000025,Birmingham,2621
2020-04-23,England,E08000025,Birmingham,2674
2020-04-24,England,E08000025,Birmingham,2719
2020-04-25,England,E08000025,Birmingham,2757
2020-04-26,England,E08000025,Birmingham,2789
2020-04-27,England,E08000025,Birmingham,2799
2020-04-28,England,E08000025,Birmingham,2801
2020-04-29,England,E08000025,Birmingham,2801

but comparing with the numbers in my previous comment, it can be seen the numbers for the 27th and 28th have been bumped up from 2782 (both) to 2799 and 2801. So, yes, presumably each regions' case-count curve can be thought of as converging with some "true" number as the data trickles in over time. But for the last day given it seems nothing has arrived yet and the number given is just the previous. It does perhaps make charts look a bit misleading though... always looking like they've just reached the point of flattening off. Makes me wonder if I should just ditch that last day's datapoint, but it's already one day behind Scotland and Wales.

airallergy commented 4 years ago

Yes these latest data could be misleading, ditching last day's data could be useful to reveal the true trend in a sense.

May I suggest another method if you want to keep the England and Wales historical data consistent with Scotland, which is to concatenate all the latest total numbers in each daily file, i.e. 29/04/2020 cumulative data published on 30/04/2020, 28/04/2020 data on 29/04/2020, etc. Though this can you transform specimen data to reporting ones. tom has archived all the old csv files, which makes it quite easy to do so.