thegraphnetwork / epigraphhub_py

Epigraphhub Python package
GNU General Public License v3.0
2 stars 9 forks source link

fix(foph): add tests for foph and weekly data version #215

Closed luabida closed 1 year ago

luabida commented 1 year ago

Description

add foph tests and weekly data format feature.

Related Issue

https://github.com/thegraphnetwork/EpiGraphHub/issues/184

Checklist

Note: The title prefix can be also expanded with more information inside the parenthesis. Some examples for a full title (prefix + message):

luabida commented 1 year ago

Example of metadata extraction. What you think @fccoelho?

In [1]: from epigraphhub.data.foph import extract

In [2]: extract.metadata()
Out[2]: 
                                                table                            filename                                        description  deprecated
0                                   DailyIncomingData              COVID19Cases_geoRegion     Daily record timelines by geoRegion for cases.        True
1                   DailyCasesVaccPersonsIncomingData            COVID19Cases_vaccpersons  Daily record timelines for cases of fully vacc...        True
2                                   DailyIncomingData               COVID19Hosp_geoRegion  Daily record timelines by geoRegion for hospit...        True
3                    DailyHospVaccPersonsIncomingData             COVID19Hosp_vaccpersons  Daily record timelines for hospitalisations of...        True
4                                   DailyIncomingData              COVID19Death_geoRegion    Daily record timelines by geoRegion for deaths.        True
..                                                ...                                 ...                                                ...         ...
64                WeeklyEpiAgeRangeSexRawIncomingData       COVID19EpiRawData_AKL10_sex_w  Combined Iso-Week record timelines for cases, ...        True
65                          PopulationAgeRangeSexData           PopulationAgeRangeSexData  Population data by geoRegion (cantons + FL), a...        True
66                       WasteWaterDailyViralLoadData                COVID19Wastewater_vl  Daily waste water viral load measurement recor...       False
67                        WasteWaterViralLoadOverview        COVID19Wastewater_vlOverview  Overview of the current percentile & differenc...       False
68  WeeklyHospBreakthroughVaccPersonsAgeRangeIncom...  COVID19Hosp_vaccpersons_AKL10_w_v2  Weekly record timeline for hospitalisations by...       False

[69 rows x 4 columns]

In [3]: extract.metadata(table='DailyIncomingData')
Out[3]: 
                                             DailyIncomingData
description                                  DailyIncomingData
properties   {'anteil_pos_14': {'description': 'Average per...
required     [datum, entries_diff_last_age, geoRegion, pop,...
title                                        DailyIncomingData
type                                                    object
enum                                                       NaN
luabida commented 1 year ago
In [10]: extract.metadata(table='DailyIncomingData').loc['properties'][0]
Out[10]: 
{'anteil_pos_14': {'description': 'Average percentage of positive tests for the latest 14 days. Only applies to records with type "COVID19Test"',
  'maximum': 100,
  'minimum': 0,
  'title': 'anteil_pos_14',
  'type': 'number'},
 'anteil_pos_28': {'description': 'Average percentage of positive tests for the latest 28 days. Only applies to records with type "COVID19Test"',
  'maximum': 100,
  'minimum': 0,
  'title': 'anteil_pos_28',
  'type': 'number'},
 'anteil_pos_all': {'description': 'Average percentage of positive tests since the start of data recording. Only applies to records with type "COVID19Test"',
  'maximum': 100,
  'minimum': 0,
  'title': 'anteil_pos_all',
  'type': 'number'},
 'datum': {'description': 'Date of case for cases, date of hospital entry for hospitalisations and the date of death for deaths.\nThe date of case is constructed taking the earliest available of:\nthe date of swab, the date of test, and the date the report arrived at the FOPH.\nGenerally, the date of case corresponds to the date the swab was taken.\n\nFormatted as ISO-Date 2020-10-12',
  'title': 'datum',
  'type': 'string'},
 'entries': {'description': 'Absolute number of occurrences for this day.',
  'title': 'entries',
  'type': 'number'},
 'entries_diff_last': {'description': 'Total difference of occurrences betweens the current and previously published data version.\n\nFrom 05.08 onwards: only the data of the last 28d is considered this calculation so it better reflects the current epidemiologic situation\n\nIs the same for every daily record of the same geoRegion.',
  'title': 'entries_diff_last',
  'type': 'number'},
 'entries_diff_last_age': {'description': 'Number of days since the previous data version had been published.\n\nOn mondays this will typically be 3, on all other days 1.',
  'title': 'entries_diff_last_age',
  'type': 'number'},
 'entries_letzter_stand': {'description': 'Absolute number of rnces for this day as reported by the previous published data version.',
  'title': 'entries_letzter_stand',
  'type': 'number'},
 'entries_neg': {'description': 'Absolute number of occurences of negative tests. Only applies to records with type "COVID19Test"',
  'title': 'entries_neg',
  'type': 'number'},
 'entries_neu_gemeldet': {'description': 'Newly reported occurrences of the current data version (entries - entries_letzter_stand) for this day.',
  'title': 'entries_neu_gemeldet',
  'type': 'number'},
 'entries_pos': {'description': 'Absolute number of occurences of positive tests. Only applies to records with type "COVID19Test"',
  'title': 'entries_pos',
  'type': 'number'},
 'geoRegion': {'$ref': 'https://covid19.admin.ch#/definitions/GeoUnit',
  'title': 'geoRegion'},
 'inz_entries': {'description': 'Incidence value for this day.',
  'title': 'inz_entries',
  'type': 'number'},
 'inzdelta7d': {'description': 'Difference between the "inz_entries" value of this record compared to the value 7 days before.',
  'title': 'inzdelta7d',
  'type': 'number'},
 'inzmean7d': {'description': 'Rolling 7 day average (-3 days/ +3 days from the current day) of the "inz_entries" values.',
  'title': 'inzmean7d',
  'type': 'number'},
 'inzsum14d': {'description': 'Sum of the "inz_entries" values for the last 14 days (rolling).',
  'title': 'inzsum14d',
  'type': 'number'},
 'inzsum7d': {'description': 'Sum of the "inz_entries" values for the last 7 days (rolling).',
  'title': 'inzsum7d',
  'type': 'number'},
 'inzsumTotal': {'description': 'Sum of the "inz_entries" values since the start of data recording.',
  'title': 'inzsumTotal',
  'type': 'number'},
 'inzsumTotal_last14d': {'description': 'Sum of the "inz_entries" values since the beginning of the latest 14 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_14d=true).',
  'title': 'inzsumTotal_last14d',
  'type': 'number'},
 'inzsumTotal_last28d': {'description': 'Sum of the "inz_entries" values since the beginning of the latest 28 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_28d=true).',
  'title': 'inzsumTotal_last28d',
  'type': 'number'},
 'inzsumTotal_last7d': {'description': 'Sum of the "inz_entries" values since the beginning of the latest 7 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_7d=true).',
  'title': 'inzsumTotal_last7d',
  'type': 'number'},
 'mean7d': {'description': 'Rolling 7 day average (-3 days/ +3 days from the current day) of the "entries" values.',
  'title': 'mean7d',
  'type': 'number'},
 'nachweismethode': {'description': 'Test type used. Only applies to records with type "Covid19Test".',
  'enum': ['Antigen_Schnelltest', 'PCR', 'all'],
  'title': 'nachweismethode',
  'type': 'string'},
 'offset_last14d': {'description': 'Sum of the "entries" values up until the start of the latest 14 day timeframe.',
  'title': 'offset_last14d',
  'type': 'number'},
 'offset_last28d': {'description': 'Sum of the "entries" values up until the start of the latest 28 day timeframe.',
  'title': 'offset_last28d',
  'type': 'number'},
 'offset_last7d': {'description': 'Sum of the "entries" values up until the start of the latest 7 day timeframe.',
  'title': 'offset_last7d',
  'type': 'number'},
 'pop': {'description': 'Population number for the given geoRegion.',
  'title': 'pop',
  'type': 'number'},
 'pos_anteil': {'description': 'Percentage of positive tests. Only applies to records with type "COVID19Test"',
  'maximum': 100,
  'minimum': 0,
  'title': 'pos_anteil',
  'type': 'number'},
 'pos_anteil_mean7d': {'description': 'Rolling 7 day average (-3 days/ +3 days from the current day) of the "pos_anteil" values.',
  'title': 'pos_anteil_mean7d',
  'type': 'number'},
 'sum14d': {'description': 'Sum of the "entries" values for the last 14 days (rolling).',
  'title': 'sum14d',
  'type': 'number'},
 'sum7d': {'description': 'Sum of the "entries" values for the last 7 days (rolling).',
  'title': 'sum7d',
  'type': 'number'},
 'sumTotal': {'description': 'Sum of the "entries" values since the start of data recording.',
  'title': 'sumTotal',
  'type': 'number'},
 'sumTotal_last14d': {'description': 'Sum of the "entries" values since the beginning of the latest 14 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_14d=true).',
  'title': 'sumTotal_last14d',
  'type': 'number'},
 'sumTotal_last28d': {'description': 'Sum of the "entries" values since the beginning of the latest 28 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_28d=true).',
  'title': 'sumTotal_last28d',
  'type': 'number'},
 'sumTotal_last7d': {'description': 'Sum of the "entries" values since the beginning of the latest 7 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_7d=true).',
  'title': 'sumTotal_last7d',
  'type': 'number'},
 'sumdelta7d': {'description': 'Difference between the "entries" value of this record compared to the value 7 days before.',
  'title': 'sumdelta7d',
  'type': 'number'},
 'timeframe_2w': {'description': 'Indicator if the record belongs to the latest 2 weeks timeframe. Current day is usually excluded.',
  'title': 'timeframe_2w',
  'type': 'boolean'},
 'timeframe_4w': {'description': 'Indicator if the record belongs to the latest 4 weeks timeframe. Current day is usually excluded.',
  'title': 'timeframe_4w',
  'type': 'boolean'},
 'timeframe_all': {'description': 'Indicator if the record belongs to the timeframe for all available data\n\nStarts at 24.02.2020 (Iso-Week 2020-09) for COVID19Cases, COVID19Hosp and COVID19Death types\nStarts at 24.01.2020 for COVID19Test type for CH+FL data combined and from 23.05.2020 for data by canton/CH/FL individually\nStarts at 30.03.2020 for COVID19HospCapacity type',
  'title': 'timeframe_all',
  'type': 'boolean'},
 'type': {'$ref': 'https://covid19.admin.ch#/definitions/IncomingDataType',
  'title': 'type'},
 'version': {'description': 'Timestamp of data processing.\n\nformat YYYY-MM-DD_HH-mm-SS',
  'title': 'version',
  'type': 'string'}}
luabida commented 1 year ago

@fccoelho friendly ping about this PR

fccoelho commented 1 year ago

Not sure I get the table format. Is properties a JSON column?

luabida commented 1 year ago

Not sure I get it. Is properties a JSON column?

Yes these properties are representing the sql table schema in their docs, the normalized dataframe version has all the indentation as sub columns, do you think is it better to have these table schemas as JSON or as DataFrame?

In [32]: pd.json_normalize(data[d[17]])
Out[32]: 
                                            required                          title  ... properties.version.title properties.version.type
0  [data_expected, date, geoRegion, pop, primary_...  WeeklyDeathReasonIncomingData  ...                  version                  string

[1 rows x 47 columns]

In [33]: list(_)
Out[33]: 
['required',
 'title',
 'type',
 'properties.altersklasse_covid19.enum',
 'properties.altersklasse_covid19.title',
 'properties.altersklasse_covid19.type',
 'properties.data_expected.description',
 'properties.data_expected.title',
 'properties.data_expected.type',
 'properties.date.description',
 'properties.date.title',
 'properties.date.type',
 'properties.entries.description',
 'properties.entries.title',
 'properties.entries.type',
 'properties.geoRegion.$ref',
 'properties.geoRegion.title',
 'properties.inz_entries.description',
 'properties.inz_entries.title',
 'properties.inz_entries.type',
 'properties.inzsumTotal.description',
 'properties.inzsumTotal.title',
 'properties.inzsumTotal.type',
 'properties.pop.description',
 'properties.pop.title',
 'properties.pop.type',
 'properties.prct.description',
 'properties.prct.title',
 'properties.prct.type',
 'properties.prctSumTotal.description',
 'properties.prctSumTotal.maximum',
 'properties.prctSumTotal.minimum',
 'properties.prctSumTotal.title',
 'properties.prctSumTotal.type',
 'properties.primary_death_reason.description',
 'properties.primary_death_reason.enum',
 'properties.primary_death_reason.title',
 'properties.primary_death_reason.type',
 'properties.sumTotal.description',
 'properties.sumTotal.title',
 'properties.sumTotal.type',
 'properties.type.enum',
 'properties.type.title',
 'properties.type.type',
 'properties.version.description',
 'properties.version.title',
 'properties.version.type']
fccoelho commented 1 year ago

In the metadata format currently in use (you can see it on EGH), there is only three columns in the *_meta tables :

  1. name
  2. type
  3. description

What you are proposing is much more complex

luabida commented 1 year ago

@fccoelho sorry I'm not sure if I'm following here.. The JSON has the title, type and description for each column in the table, the name properties in the front of the column name in the normalized version, is because it is a JSON indentation of properties and so on

fccoelho commented 1 year ago

Ok I got it now. The extraction seems to be fine. But I noticed that some of the columns do have more metadata, such as 'maximum' and 'minimum' besides the three basic descriptors.

luabida commented 1 year ago

Should properties be a separated dataframe or is it ok to store as JSON in a properties column?

github-actions[bot] commented 1 year ago

:tada: This PR is included in version 2.0.5 :tada:

The release is available on:

Your semantic-release bot :package::rocket: