ulklc / covid19-timeseries

Covid19 timeseries data store
MIT License
38 stars 9 forks source link

Recommended alternative? #28

Open iandoug opened 3 years ago

iandoug commented 3 years ago

Hi

Anyone able to recommend an alternative data feed?

Thanks, Ian

rvneil commented 3 years ago

https://covid.ourworldindata.org/data/owid-covid-data.csv has everything in one file. I guess I'm going to have to switch to this source, as this repo seems to be completely dead.

https://github.com/CSSEGISandData has some data split out by province/state (for Canada, Australia, China, etc), which I didn't want. I suppose I could code something to come up with totals for those countries.

iandoug commented 3 years ago

I think I'm going to use this source.. https://github.com/datasets/covid-19

in particular, probably https://datahub.io/core/covid-19/r/countries-aggregated.csv since I don't need province/state data.. Link on https://datahub.io/core/covid-19

I had issues with John Hopkins data before which is why I switched to this repo, my suggestion claims to have cleaned up the messy bits in JH data.

Need to update my loader program and deal with possible country name issues today.

Cheers, Ian

kallewoof commented 3 years ago

FWIW, I made a (very simple) converter https://github.com/kallewoof/covid19-csv-converter between the old format (John Hopkins IIRC) and this one, and I will probably add another mode for the covid.ourworldindata.org variant soon, since this one also seems to have gone under..

iandoug commented 3 years ago

I think I'm going to use this source.. https://github.com/datasets/covid-19

Looks like even after their clean-up, there are still strange bumps in the data. Guess I will just have to live with it.

Cheers, Ian

kallewoof commented 3 years ago

@iandoug I'm not super happy with the owid dataset, so I am probably going to switch to the datasets one. Could you work around the strange bumps by using this dataset and append only the missing data?

iandoug commented 3 years ago

@iandoug I'm not super happy with the owid dataset, so I am probably going to switch to the datasets one. Could you work around the strange bumps by using this dataset and append only the missing data?

Mmmnnn.. that's an idea I didn't think of.

I'm a bit reluctant though, because THIS repo used end-of-day around midnight GMT (or maybe 2am, never could figure it out, I fetched at 4am GMT) and datasets/John Hopkins uses ((I think) midnight Eastern Standard time as their cut-off point. So "cases on 2020-xx-yy" is going to differ between the two sets, making a merge tricky.

I see "datasets" has not updated since yesterday, and several closed tickets on their repo about it NOT updating in the past, so that's a bit worrying in terms of reliability. I switched from JH data long ago because they had so many issues and kept changing their file layouts etc.

Regarding the bumps, given the number of sites using datasets data, you'd think they would have sorted it out by now. :-(

Let me ponder your idea a bit more.

I had to fix these country names between this repo, datasets, and my names, your fix list may be similar or not.

     "Korea, South"  :  South Korea  (annoying, that one)
    Burma  :   Myanmar
    Cabo Verde  :  Cape Verde
    China  :  Mainland China
     Congo (Brazzaville)  :  Congo
     Congo (Kinshasa)  :  DRC
     Cote d'Ivoire  :  Cote d’Ivoire
     Eswatini  :  eSwatini
     Holy See  :  Vatican
     Kazakhstan  :  Kazakstan
     Kyrgyzstan  :  Kyrgystan
     Taiwan*  :  Taiwan
     US  :  United States
     West Bank and Gaza  :  Palestine

Cheers, Ian

kallewoof commented 3 years ago

Hi Ian,

Yeah, I think I followed your exact foot steps. It's still a rough proof of concept, but I have a tool to convert between these here: https://github.com/kallewoof/csvman

To get the github.com/datasets/covid-19.git data set into the ulklc format, clone the above, then:

g++ -O3 -std=c++11 parser/*.cpp *.cpp -o compile
./compile formulas/covid-19/gds.cmf GDSDIR/time-series-19-covid-combined.csv -f formulas/covid-19/ulklc.cmf result.csv

It's still a WIP but yeah, it supports fixing names and such manually. I've got part of the ones you listed but will add the others.

Also, not sure what you mean by the dates being 1 off -- are the actual dates in the file showing for one day earlier/later depending on the set??

Edit: I don't see several of the country name differences that you are listing (e.g. both this repo and the datasets/covid-19 one use "China", "Kyrgyzstan", "Kazakhstan", ...).

iandoug commented 3 years ago

Also, not sure what you mean by the dates being 1 off -- are the actual dates in the file showing for one day earlier/later depending on the set??

It depends on when countries release their figures, and when the various sites process the numbers. eg

site 1 : day ends at midnight GMT site 2: day ends at 6am GMT

so figures released at 2am GMT is going to be on different days in each data set.

datasets data is a mess around 13-14 December because the Turkey figure is wrong. I did raise it as an issue but it looks like can't fix/won't fix because that's what they get from JH. Which is exactly the kind of reason I stopped using JH in the first place.

What also bothers me is the huge discrepancy between their numbers and WorldoMeter ... eg yesterday WoM 89,343,183, datasets 88,860,500, about half a million less.There used to be around 10-40k difference before which I accepted as end-of-day differences.

Still hoping ulklc will resurface.

Cheers, Ian