nevrome / covid19germany

R package - Load, visualise and analyse daily updated data on the COVID-19 outbreak in Germany
Other
46 stars 8 forks source link

Data download is unreliable and sometimes (!) yields incomplete data #32

Open arne1921KF opened 3 years ago

arne1921KF commented 3 years ago

Today (2020-01-11), timeseries data downloaded via usual get_RKI_timeseries() with standard parameter url = https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv" delivers only some data from Hamburg, Schleswig-Holstein and Niedersachsen.

The page https://hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0 informs they are currently changing the DL options, and https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74 should currently be used.

The DL link there is currently hidden on the page behind the links/buttons.

nevrome commented 3 years ago

@stschiff already observed a similar issue last week. Has solved itself overnight. Maybe we have to switch to the alternative download option eventually, but for now I suggest to wait once more.

nevrome commented 3 years ago

So right now it seems to work again:

> rki_timeseries <- get_RKI_timeseries()
> unique(rki_timeseries$Bundesland)
 [1] "Brandenburg"            "Bayern"                
 [3] "Niedersachsen"          "Nordrhein-Westfalen"   
 [5] "Baden-Württemberg"      "Saarland"              
 [7] "Rheinland-Pfalz"        "Schleswig-Holstein"    
 [9] "Hessen"                 "Hamburg"               
[11] "Bremen"                 "Sachsen"               
[13] "Thüringen"              "Berlin"                
[15] "Mecklenburg-Vorpommern" "Sachsen-Anhalt" 
arne1921KF commented 3 years ago

....and gone again. Now they changed something in the data itself, it seems. I get parsing failures. Looks like the date columns changed. That breaks your code.

I hate it when data providers do this.

nevrome commented 3 years ago

Hm - can't confirm right now. Seems to work again.

But I get the feeling this download feature breaks multiple times a day. Maybe it's because the file grew to >55mb and the way we download it is just not suitable any more.

Maybe we should copy it automatically to an extra branch here on github once a day and point the default path of get_RKI_timeseries to our mirror.

arne1921KF commented 3 years ago

Aaaaand dead again. Only Schleswig-Holstein present in the timeseries. Has been like this at 5 am, when my bot tried to pull the current data. Is still the case at 9 am.

A git of the data would be rad. I seriously would like to know why the RKI isn't doing this themselves: just pushing the data to github, as soon as it is in. Like that, the dataset would even be transparent for monitoring changes directly using versioning.

nevrome commented 3 years ago

I merged #34 now to permanently enable the download from the alternative source. This seems to be more reliable.