psteinb / covid19-curve-your-city

Extrapolation der COVID19 Fallzahlen
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

Check and validate npgeo-corona-npgeo-de.hub.arcgis.com #7

Open psteinb opened 4 years ago

psteinb commented 4 years ago

I've been pointed to this webseite multiple times: https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0/data

I found that the numbers here don't match the numbers as described in #6 which I so far use by default. For example, going to the display by county https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/917fc37a709542548cc3be077a786c17_0 Downloading the spreadsheet gives a csv file with 36 columns!

$ csvstat -n RKI_Corona_Landkreise.csv
  1: OBJECTID
  2: ADE
  3: GF
  4: BSG
  5: RS
  6: AGS
  7: SDV_RS
  8: GEN
  9: BEZ
 10: IBZ
 11: BEM
 12: NBD
 13: SN_L
 14: SN_R
 15: SN_K
 16: SN_V1
 17: SN_V2
 18: SN_G
 19: FK_S3
 20: NUTS
 21: RS_0
 22: AGS_0
 23: WSK
 24: EWZ
 25: KFL
 26: DEBKG_ID
 27: Shape__Area
 28: Shape__Length
 29: death_rate
 30: cases
 31: deaths
 32: cases_per_100k
 33: cases_per_population
 34: BL
 35: BL_ID
 36: county

The metadata for this table is a bit lengthy and hard to digest: https://www.arcgis.com/sharing/rest/content/items/917fc37a709542548cc3be077a786c17/info/metadata/metadata.xml?format=default&output=html

Digging through this I could extract:

$ csvgrep -c36 -m 'Dresden' RKI_Corona_Landkreise.csv|csvcut -c1,8,9,30-36 |csvlook
| OBJECTID  | GEN     | BEZ              | cases | deaths | cases_per_100k | cases_per_population | BL      | BL_ID | county     |
| --------- | ------- | ---------------- | ----- | ------ | -------------- | -------------------- | ------- | ----- | ---------- |
|       357 | Dresden | Kreisfreie Stadt |    99 |  False |        17.849… |               0.018… | Sachsen |    14 | SK Dresden |

So today (Mar 23, 2020) this would be 99 cases which is the number between Mar 20 97 and Mar 21 115. So for me this gives:

psteinb commented 4 years ago

People pointed me to this variant of the same dataset https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0/data this contains updated numbers by age group (which is nice). But I am unclear how to aggregate the number of cases today?

IdBundesland Bundesland Landkreis Altersgruppe Geschlecht AnzahlFall AnzahlTodesfall ObjectId Meldedatum IdLandkreis Datenstand
14 Sachsen SK Dresden A00-A04 M 1 False 182,113 2020-03-13 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A00-A04 M 2 False 182,114 2020-03-19 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A05-A14 M 1 False 182,115 2020-03-17 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 1 False 182,116 2020-03-09 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 1 False 182,117 2020-03-10 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 1 False 182,118 2020-03-12 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 1 False 182,119 2020-03-14 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 2 False 182,120 2020-03-16 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 3 False 182,121 2020-03-17 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 3 False 182,122 2020-03-18 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 3 False 182,123 2020-03-19 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 4 False 182,124 2020-03-20 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 M 1 False 182,125 2020-03-21 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 1 False 182,126 2020-03-11 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 1 False 182,127 2020-03-12 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 1 False 182,128 2020-03-16 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 2 False 182,129 2020-03-17 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 5 False 182,130 2020-03-19 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 2 False 182,131 2020-03-20 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A15-A34 W 3 False 182,132 2020-03-21 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 1 False 182,133 2020-03-11 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 3 False 182,134 2020-03-12 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 4 False 182,135 2020-03-13 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 1 False 182,136 2020-03-14 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 1 False 182,137 2020-03-16 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 2 False 182,138 2020-03-17 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 1 False 182,139 2020-03-18 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 9 False 182,140 2020-03-19 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 5 False 182,141 2020-03-20 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 M 5 False 182,142 2020-03-21 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 1 False 182,143 2020-03-13 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 4 False 182,144 2020-03-14 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 1 False 182,145 2020-03-16 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 2 False 182,146 2020-03-17 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 1 False 182,147 2020-03-18 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 3 False 182,148 2020-03-19 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 1 False 182,149 2020-03-20 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A35-A59 W 5 False 182,150 2020-03-21 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A60-A79 M 1 False 182,151 2020-03-07 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A60-A79 M 1 False 182,152 2020-03-13 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A60-A79 M 1 False 182,153 2020-03-18 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A60-A79 M 2 False 182,154 2020-03-20 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A60-A79 W 1 False 182,155 2020-03-07 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A60-A79 W 1 False 182,156 2020-03-21 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A80+ M 1 False 182,157 2020-03-17 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A80+ W 1 False 182,158 2020-03-19 00:00:00+00:00 14,612 23.03.2020 00:00
14 Sachsen SK Dresden A80+ W 1 False 182,159 2020-03-20 00:00:00+00:00 14,612 23.03.2020 00:00

At least this one yields more recent numbers with today's date. Ideas?

psteinb commented 4 years ago

how to filter this data https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0/data

elcorto commented 4 years ago

There are some projects started in the wirvsvirus hackathon dealing with data. Thanks for the hint @frzb!

The ones I checked use the RKI data you mention. For example this provides Python and R code (github) or that one in R (github).

However the data and APIs are weird. Here you can configure a custom query URL, which is nice, but the returned data doesn't contain all federal states (e.g. not Sachsen, but Hamburg etc). If instead we use this and either download a csv or json file, the data contains all states. See update in this comment below.

The second R project from above looks decent, so you can check out how they do it. I don't speak R, but in Python I'd do something like this to aggregate the data:

>>> import pandas as pd
>>> df=pd.read_csv('https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv')
   IdBundesland          Bundesland     Landkreis Altersgruppe Geschlecht  AnzahlFall  AnzahlTodesfall  ObjectId                Meldedatum  IdLandkreis        Datenstand  NeuerFall  NeuerTodesfall
0             1  Schleswig-Holstein  SK Flensburg      A15-A34          M           1                0    207841  2020-03-14T00:00:00.000Z         1001  24.03.2020 00:00          0               0
1             1  Schleswig-Holstein  SK Flensburg      A15-A34          M           2                0    207842  2020-03-19T00:00:00.000Z         1001  24.03.2020 00:00          0               0
2             1  Schleswig-Holstein  SK Flensburg      A15-A34          M           1                0    207843  2020-03-21T00:00:00.000Z         1001  24.03.2020 00:00          0               0
3             1  Schleswig-Holstein  SK Flensburg      A15-A34          W           1                0    207844  2020-03-14T00:00:00.000Z         1001  24.03.2020 00:00          0               0
4             1  Schleswig-Holstein  SK Flensburg      A15-A34          W           2                0    207845  2020-03-18T00:00:00.000Z         1001  24.03.2020 00:00          0               0

>>> pd.to_datetime(df.Datenstand).unique()
array(['2020-03-24T00:00:00.000000000'], dtype='datetime64[ns]')

>>> df[df.Landkreis.str.contains('Dresden')].head()
      IdBundesland Bundesland   Landkreis Altersgruppe Geschlecht  AnzahlFall  AnzahlTodesfall  ObjectId                Meldedatum  IdLandkreis        Datenstand  NeuerFall  NeuerTodesfall
7041            14    Sachsen  SK Dresden      A00-A04          M           1                0    219882  2020-03-13T00:00:00.000Z        14612  24.03.2020 00:00          0               0
7042            14    Sachsen  SK Dresden      A00-A04          M           2                0    219883  2020-03-19T00:00:00.000Z        14612  24.03.2020 00:00          0               0
7043            14    Sachsen  SK Dresden      A05-A14          M           1                0    219884  2020-03-17T00:00:00.000Z        14612  24.03.2020 00:00          0               0
7044            14    Sachsen  SK Dresden      A15-A34          M           1                0    219885  2020-03-09T00:00:00.000Z        14612  24.03.2020 00:00          0               0
7045            14    Sachsen  SK Dresden      A15-A34          M           1                0    219886  2020-03-10T00:00:00.000Z        14612  24.03.2020 00:00          0               0

>>> cols=['Meldedatum', 'AnzahlFall']
>>> df=df[df.Landkreis.str.contains('Dresden')][cols]
>>> df.Meldedatum = pd.to_datetime(df.Meldedatum)
>>> df = df.sort_values('Meldedatum')
>>> df.head()
                    Meldedatum  AnzahlFall
7083 2020-03-07 00:00:00+00:00           1
7079 2020-03-07 00:00:00+00:00           1
7044 2020-03-09 00:00:00+00:00           1
7045 2020-03-10 00:00:00+00:00           1
7061 2020-03-11 00:00:00+00:00           1

>>> ts=df.groupby('Meldedatum')['AnzahlFall'].sum()
>>> ts
Meldedatum
2020-03-07 00:00:00+00:00     2
2020-03-09 00:00:00+00:00     1
2020-03-10 00:00:00+00:00     1
2020-03-11 00:00:00+00:00     2
2020-03-12 00:00:00+00:00     5
2020-03-13 00:00:00+00:00     7
2020-03-14 00:00:00+00:00     6
2020-03-16 00:00:00+00:00     5
2020-03-17 00:00:00+00:00    11
2020-03-18 00:00:00+00:00     6
2020-03-19 00:00:00+00:00    23
2020-03-20 00:00:00+00:00    15
2020-03-21 00:00:00+00:00    15
Name: AnzahlFall, dtype: int64

>>> ts.cumsum()
Meldedatum
2020-03-07 00:00:00+00:00     2
2020-03-09 00:00:00+00:00     3
2020-03-10 00:00:00+00:00     4
2020-03-11 00:00:00+00:00     6
2020-03-12 00:00:00+00:00    11
2020-03-13 00:00:00+00:00    18
2020-03-14 00:00:00+00:00    24
2020-03-16 00:00:00+00:00    29
2020-03-17 00:00:00+00:00    40
2020-03-18 00:00:00+00:00    46
2020-03-19 00:00:00+00:00    69
2020-03-20 00:00:00+00:00    84
2020-03-21 00:00:00+00:00    99
Name: AnzahlFall, dtype: int64

>>> ts.cumsum().plot()

That gives your plot. However, the last data point is from 2020-03-21, as in the data you show above, even though the column 'Datenstand' says 2020-03-24. Huh? Tried it just now, they updated the data, now including 2020-03-24. Seems like numbers might be dropping slowly .. fingers crossed!

psteinb commented 4 years ago

Thanks for reproducing this. I just pushed the code to do the same in R :-D I'll have a look at these wonderful repos tomorrow.

psteinb commented 4 years ago

Indeed https://github.com/nevrome/covid19germany looks very nice. I'll stick to this one for now in order to obtain the RKI data.

elcorto commented 4 years ago

Update: Custom query URLs seem to work correctly now. No need to download a 2 MB CSV file when the data we need is only 2 kB of json :)

# query only AnzahlFall,Meldedatum
url = "https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/RKI_COVID19/FeatureServer/0/query?where=UPPER(Landkreis)%20like%20'%25DRESDEN%25'&outFields=AnzahlFall,Meldedatum&outSR=4326&f=json"
d = requests.get(url).json()
df = pd.DataFrame([x['attributes'] for x in d['features']])
time_col = 'Meldedatum'
df[time_col] = pd.to_datetime(df[time_col], unit='ms')
ts = df.sort_values(time_col).groupby(time_col)['AnzahlFall'].sum()
ts_cs = ts.cumsum().resample('1d').ffill()
psteinb commented 4 years ago

Thanksf for this code snippet. If you want, feel free to send a PR with script that takes the name of the Landkreis and dumps a csv. We could use this to validate the R code or vice versa.