tomwhite / covid-19-uk-data

Coronavirus (COVID-19) UK Historical Data
http://tom-e-white.com/covid-19-uk-data/
The Unlicense
162 stars 79 forks source link

unplotted cases #23

Closed nanjizal closed 4 years ago

nanjizal commented 4 years ago

Tom currently this is unplotted data.

I am not sure what to do about

I need to add some new locations to my map, but I don't yet have an outline for them.. Derry and Strabane North Down and Ards

Can you provide more details on "12 Mar, England 491 ill" & "20 Mar, England 3384 ill", are these summaries or data in additional to regions. If they are summaries they don't belong in the file?

"To be confirmed" and "Awaiting Confirmation" should be merged to a single name, need to be more consistant than gov?

Perhaps best to filter out the ugly data from the main source and provide a supplemental source. Obviously I can filter out problem data but I think it would be better to remove them from the main source so aspects like change of name can be focused on.

I was thinking of 3D render with simple extrusion per day using my trilateral2 webgl library do you think that would be useful?

Could we perhaps create a team between interested parties to release an app to gather more data direct from public, I could definitely create mobile phone graphics applications with haxe that would run on android, iphone ... with some 3D with assistance on mobile integration... all works in theory but apple especially I don't have phone to test, I don't have funds for servers, buying domain names paying for apple dev licences, and promotion etc.. ?

If your free weekend we could have a group skype or similar tech meetup with interested parties, the gov data is very limited.

- unplotted
5 Mar, awaiting clarification 8 ill,
7 Mar, awaiting clarification 14 ill,
8 Mar, Awaiting confirmation 20 ill,
9 Mar, Awaiting confirmation 26 ill,
10 Mar, Awaiting confirmation 15 ill,
12 Mar, England 491 ill,
17 Mar, Resident outside Wales 2 ill,
20 Mar, England 3384 ill,
20 Mar, To be confirmed 1 ill,
20 Mar, Resident outside Wales 2 ill,
21 Mar, To be confirmed 1 ill,
21 Mar, Resident outside Wales 2 ill,
22 Mar, To be confirmed 1 ill,
22 Mar, Resident outside Wales 3 ill,
23 Mar, To be confirmed 1 ill,
23 Mar, Resident outside Wales 3 ill,
24 Mar, To be confirmed 2 ill,
24 Mar, Resident outside Wales 3 ill,
25 Mar, To be confirmed 4 ill,
25 Mar, Resident outside Wales 4 ill,
26 Mar, Unknown 7 ill,
26 Mar, Derry and Strabane 8 ill,
26 Mar, North Down and Ards 22 ill,
26 Mar, Unknown 7 ill,
26 Mar, Resident outside Wales 5 ill,
26 Mar, To be confirmed 7 ill,
27 Mar, Derry and Strabane 9 ill,
27 Mar, North Down and Ards 25 ill,
27 Mar, Unknown 7 ill,
27 Mar, Resident outside Wales 6 ill,
27 Mar, To be confirmed 13 ill,
28 Mar, Resident outside Wales 6 ill,
28 Mar, To be confirmed 18 ill,
29 Mar, Resident outside Wales 10 ill,
29 Mar, To be confirmed 22 ill,
30 Mar, Derry and Strabane 23 ill,
30 Mar, North Down and Ards 52 ill,
30 Mar, Unknown 22 ill,
30 Mar, Resident outside Wales 12 ill,
30 Mar, To be confirmed 24 ill,
31 Mar, Derry and Strabane 24 ill,
31 Mar, North Down and Ards 61 ill,
31 Mar, Unknown 29 ill,
31 Mar, Resident outside Wales 13 ill,
nanjizal commented 4 years ago

Would be keen with an app to use cross platform haxe for the app, and use python server for processing and collating data?

nanjizal commented 4 years ago

With a team we could have people like me working on presentation and interaction, and others working on curve fitting and modelling, and others on structuring and collating and storage, and others on advertising to get more useful data.

timday commented 4 years ago

Could we perhaps create a team between interested parties to release an app to gather more data direct from public

Given the data presented by this repo is ultimately being sourced through long established channels and protocols for reporting of "notifiable diseases" in a reasonably robust and consistent way, I find the idea of polluting such "official" data with the anarchy which would inevitably result from "direct" data gathering somewhat perverse. You might as well try and figure out the virus' spread from surveying tweets or firing off Facebook polls or similar.

Don't know if you're aware of the https://covid.joinzoe.com/ app? Recent press release from Kings College (the academic partner; ZOE were a spun-out startup originally with a nutrition app but they quickly got a covid-tracker app out) at https://www.kcl.ac.uk/news/symptom-tracker-app-hits-15-million-uk-users . Given their reach, they must be gathering some quite interesting "real time" data but I'm not sure they've released anything yet. I'd certainly love to see a webpage with a statistical summary of what their users are reporting and seeing how it compares with the PHE numbers. (Oh, but I just noticed they put this out https://youtu.be/VD1oNX_L6eU ... not watched it yet.)

...get more useful data.

I'm puzzled why you'd think the data here isn't useful. What are you hoping to do with it? Having to deal with odd "Unknown" or "unconfirmed" type categories is a fact of life dealing with messy real world data; it's very rarely a showstopper issue though.

timday commented 4 years ago

... I did get around to watching the KC/ZOE team's video linked above. Well worth the time, gives some idea of the rapid science that can be done on such data, albeit perhaps tempered by patchy demographic and geographical coverage. Interestingly at some point they do mention that they're considering how to open up the data they're gathering, at least at "map level".

tomwhite commented 4 years ago

I agree with @timday - this repo is for collating the raw data from the UK Public Health bodies.

The point about "Unconfirmed" categories is worth thinking about. However, I'm wary of normalizing these categories since it's too easy to make assumptions that are not valid about the source data. Some changes to the source data are OK I think - e.g. change "Ayrshire & Arran" to "Ayrshire and Arran" for consistency, but lumping all "Unknown" categories into one is not so clearcut, so I'd rather err on the side of leaving them as they are, even though there is a minor inconvenience to downstream users of the data.

Thanks for sharing the ZOE video Tim - I agree it's well worth a watch. And evenryone sign up to the app if you haven't already - and tell your friends and family about it too.

nanjizal commented 4 years ago

What are you hoping to do with it? @timday I am currently just rendering the cases to a map. https://nanjizal.github.io/covid19/bin/index.html?test3

I am most concerned about:

Can you provide more details on "12 Mar, England 491 ill" & "20 Mar, England 3384 ill", are these summaries or data in additional to regions. If they are summaries they don't belong in the file?

I really do not understand how they relate to the other data, they seem pretty random, I just not sure what to do with them are they a summary, in which case they surely don't belong in this csv?

Do you understand the difference between:

"To be confirmed" and "Awaiting Confirmation" surely this is the same thing? I can code round it but if they are the same... then the csv should standardise to make it easier for new consumers, and in fact if the data is still under review then I don't think it's relevant anyway, especially given it has no assigned location and given we know the data is only a relation to the reality anyway... so discarding this data is not going to make any difference it just makes the feed cleaner.

I think while the gov wants to record all data, anyone wanting to plot the data is likely more interested in data that can be plotted and related not data that is under review.

timday commented 4 years ago

Real world data is messy by default. Odd "unknown"/"unconfirmed"/"unassigned"-type categories are to be expected. How a data consumer deals with them depends a lot on what they're trying to achieve. If your map-plotting app can discard "unknown", that's one perfectly valid solution. But other consumers of the data may regard them as providing valuable information about the uncertainties. So IMHO these csv files are fine as they are and this isn't the right place to be cleaning them up. Of course ideally the upstream source would have such a tight operation that they'd never need to put any data out in oddball categories... but again the real world isn't always that organized.

nanjizal commented 4 years ago

It's just ugly to filter out in real time in a browser since I consume git raw direct from this repo, it would be ideal to have a streamlined version created at the same time as Tom creates the other data. Currently I ignore "awaiting clarification,Awaiting confirmation,Resident outside Wales,To be confirmed,Unknown" and have to trim and scan data for names etc.. If I was running a server I guess I could create a service that injected the lat/long and eastNorth data and cleaned up names etc.. could create something in php/python/java haxe if it was useful, but not really in a position to host anything. Just assumed other users would like cleaner data that is easier to parse. I am not really sure if I got all the lat long positions sensible for wales it's kind of tricky, gov could be a lot more organised in regard to a lot of the related data, it's like they have it but someone forgot to explain that excel is not how you store data for programmers, it's kind of crazy that I have to spend so long googling to match up some places to locations.

timday commented 4 years ago

"Running a server" would seem to be overkill for some simple reformatting/cleaning up/merging of csv file data into something your browser finds it a bit easier to digest. I've not played with them myself (beyond the way GitHub Pages uses them) but I get the impression GitHub Actions can be used to regularly (cron scheduler style) run some python taking some data from somewhere and putting some results somewhere else (pushing them to your own repo, for example).