Closed luabida closed 1 year ago
Example of metadata extraction. What you think @fccoelho?
In [1]: from epigraphhub.data.foph import extract
In [2]: extract.metadata()
Out[2]:
table filename description deprecated
0 DailyIncomingData COVID19Cases_geoRegion Daily record timelines by geoRegion for cases. True
1 DailyCasesVaccPersonsIncomingData COVID19Cases_vaccpersons Daily record timelines for cases of fully vacc... True
2 DailyIncomingData COVID19Hosp_geoRegion Daily record timelines by geoRegion for hospit... True
3 DailyHospVaccPersonsIncomingData COVID19Hosp_vaccpersons Daily record timelines for hospitalisations of... True
4 DailyIncomingData COVID19Death_geoRegion Daily record timelines by geoRegion for deaths. True
.. ... ... ... ...
64 WeeklyEpiAgeRangeSexRawIncomingData COVID19EpiRawData_AKL10_sex_w Combined Iso-Week record timelines for cases, ... True
65 PopulationAgeRangeSexData PopulationAgeRangeSexData Population data by geoRegion (cantons + FL), a... True
66 WasteWaterDailyViralLoadData COVID19Wastewater_vl Daily waste water viral load measurement recor... False
67 WasteWaterViralLoadOverview COVID19Wastewater_vlOverview Overview of the current percentile & differenc... False
68 WeeklyHospBreakthroughVaccPersonsAgeRangeIncom... COVID19Hosp_vaccpersons_AKL10_w_v2 Weekly record timeline for hospitalisations by... False
[69 rows x 4 columns]
In [3]: extract.metadata(table='DailyIncomingData')
Out[3]:
DailyIncomingData
description DailyIncomingData
properties {'anteil_pos_14': {'description': 'Average per...
required [datum, entries_diff_last_age, geoRegion, pop,...
title DailyIncomingData
type object
enum NaN
In [10]: extract.metadata(table='DailyIncomingData').loc['properties'][0]
Out[10]:
{'anteil_pos_14': {'description': 'Average percentage of positive tests for the latest 14 days. Only applies to records with type "COVID19Test"',
'maximum': 100,
'minimum': 0,
'title': 'anteil_pos_14',
'type': 'number'},
'anteil_pos_28': {'description': 'Average percentage of positive tests for the latest 28 days. Only applies to records with type "COVID19Test"',
'maximum': 100,
'minimum': 0,
'title': 'anteil_pos_28',
'type': 'number'},
'anteil_pos_all': {'description': 'Average percentage of positive tests since the start of data recording. Only applies to records with type "COVID19Test"',
'maximum': 100,
'minimum': 0,
'title': 'anteil_pos_all',
'type': 'number'},
'datum': {'description': 'Date of case for cases, date of hospital entry for hospitalisations and the date of death for deaths.\nThe date of case is constructed taking the earliest available of:\nthe date of swab, the date of test, and the date the report arrived at the FOPH.\nGenerally, the date of case corresponds to the date the swab was taken.\n\nFormatted as ISO-Date 2020-10-12',
'title': 'datum',
'type': 'string'},
'entries': {'description': 'Absolute number of occurrences for this day.',
'title': 'entries',
'type': 'number'},
'entries_diff_last': {'description': 'Total difference of occurrences betweens the current and previously published data version.\n\nFrom 05.08 onwards: only the data of the last 28d is considered this calculation so it better reflects the current epidemiologic situation\n\nIs the same for every daily record of the same geoRegion.',
'title': 'entries_diff_last',
'type': 'number'},
'entries_diff_last_age': {'description': 'Number of days since the previous data version had been published.\n\nOn mondays this will typically be 3, on all other days 1.',
'title': 'entries_diff_last_age',
'type': 'number'},
'entries_letzter_stand': {'description': 'Absolute number of rnces for this day as reported by the previous published data version.',
'title': 'entries_letzter_stand',
'type': 'number'},
'entries_neg': {'description': 'Absolute number of occurences of negative tests. Only applies to records with type "COVID19Test"',
'title': 'entries_neg',
'type': 'number'},
'entries_neu_gemeldet': {'description': 'Newly reported occurrences of the current data version (entries - entries_letzter_stand) for this day.',
'title': 'entries_neu_gemeldet',
'type': 'number'},
'entries_pos': {'description': 'Absolute number of occurences of positive tests. Only applies to records with type "COVID19Test"',
'title': 'entries_pos',
'type': 'number'},
'geoRegion': {'$ref': 'https://covid19.admin.ch#/definitions/GeoUnit',
'title': 'geoRegion'},
'inz_entries': {'description': 'Incidence value for this day.',
'title': 'inz_entries',
'type': 'number'},
'inzdelta7d': {'description': 'Difference between the "inz_entries" value of this record compared to the value 7 days before.',
'title': 'inzdelta7d',
'type': 'number'},
'inzmean7d': {'description': 'Rolling 7 day average (-3 days/ +3 days from the current day) of the "inz_entries" values.',
'title': 'inzmean7d',
'type': 'number'},
'inzsum14d': {'description': 'Sum of the "inz_entries" values for the last 14 days (rolling).',
'title': 'inzsum14d',
'type': 'number'},
'inzsum7d': {'description': 'Sum of the "inz_entries" values for the last 7 days (rolling).',
'title': 'inzsum7d',
'type': 'number'},
'inzsumTotal': {'description': 'Sum of the "inz_entries" values since the start of data recording.',
'title': 'inzsumTotal',
'type': 'number'},
'inzsumTotal_last14d': {'description': 'Sum of the "inz_entries" values since the beginning of the latest 14 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_14d=true).',
'title': 'inzsumTotal_last14d',
'type': 'number'},
'inzsumTotal_last28d': {'description': 'Sum of the "inz_entries" values since the beginning of the latest 28 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_28d=true).',
'title': 'inzsumTotal_last28d',
'type': 'number'},
'inzsumTotal_last7d': {'description': 'Sum of the "inz_entries" values since the beginning of the latest 7 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_7d=true).',
'title': 'inzsumTotal_last7d',
'type': 'number'},
'mean7d': {'description': 'Rolling 7 day average (-3 days/ +3 days from the current day) of the "entries" values.',
'title': 'mean7d',
'type': 'number'},
'nachweismethode': {'description': 'Test type used. Only applies to records with type "Covid19Test".',
'enum': ['Antigen_Schnelltest', 'PCR', 'all'],
'title': 'nachweismethode',
'type': 'string'},
'offset_last14d': {'description': 'Sum of the "entries" values up until the start of the latest 14 day timeframe.',
'title': 'offset_last14d',
'type': 'number'},
'offset_last28d': {'description': 'Sum of the "entries" values up until the start of the latest 28 day timeframe.',
'title': 'offset_last28d',
'type': 'number'},
'offset_last7d': {'description': 'Sum of the "entries" values up until the start of the latest 7 day timeframe.',
'title': 'offset_last7d',
'type': 'number'},
'pop': {'description': 'Population number for the given geoRegion.',
'title': 'pop',
'type': 'number'},
'pos_anteil': {'description': 'Percentage of positive tests. Only applies to records with type "COVID19Test"',
'maximum': 100,
'minimum': 0,
'title': 'pos_anteil',
'type': 'number'},
'pos_anteil_mean7d': {'description': 'Rolling 7 day average (-3 days/ +3 days from the current day) of the "pos_anteil" values.',
'title': 'pos_anteil_mean7d',
'type': 'number'},
'sum14d': {'description': 'Sum of the "entries" values for the last 14 days (rolling).',
'title': 'sum14d',
'type': 'number'},
'sum7d': {'description': 'Sum of the "entries" values for the last 7 days (rolling).',
'title': 'sum7d',
'type': 'number'},
'sumTotal': {'description': 'Sum of the "entries" values since the start of data recording.',
'title': 'sumTotal',
'type': 'number'},
'sumTotal_last14d': {'description': 'Sum of the "entries" values since the beginning of the latest 14 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_14d=true).',
'title': 'sumTotal_last14d',
'type': 'number'},
'sumTotal_last28d': {'description': 'Sum of the "entries" values since the beginning of the latest 28 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_28d=true).',
'title': 'sumTotal_last28d',
'type': 'number'},
'sumTotal_last7d': {'description': 'Sum of the "entries" values since the beginning of the latest 7 days timeframe.\n\nStarts at 0 for the first day of the timeframe. Only available on records marked as belonging to the timeframe (timeframe_7d=true).',
'title': 'sumTotal_last7d',
'type': 'number'},
'sumdelta7d': {'description': 'Difference between the "entries" value of this record compared to the value 7 days before.',
'title': 'sumdelta7d',
'type': 'number'},
'timeframe_2w': {'description': 'Indicator if the record belongs to the latest 2 weeks timeframe. Current day is usually excluded.',
'title': 'timeframe_2w',
'type': 'boolean'},
'timeframe_4w': {'description': 'Indicator if the record belongs to the latest 4 weeks timeframe. Current day is usually excluded.',
'title': 'timeframe_4w',
'type': 'boolean'},
'timeframe_all': {'description': 'Indicator if the record belongs to the timeframe for all available data\n\nStarts at 24.02.2020 (Iso-Week 2020-09) for COVID19Cases, COVID19Hosp and COVID19Death types\nStarts at 24.01.2020 for COVID19Test type for CH+FL data combined and from 23.05.2020 for data by canton/CH/FL individually\nStarts at 30.03.2020 for COVID19HospCapacity type',
'title': 'timeframe_all',
'type': 'boolean'},
'type': {'$ref': 'https://covid19.admin.ch#/definitions/IncomingDataType',
'title': 'type'},
'version': {'description': 'Timestamp of data processing.\n\nformat YYYY-MM-DD_HH-mm-SS',
'title': 'version',
'type': 'string'}}
@fccoelho friendly ping about this PR
Not sure I get the table format. Is properties a JSON column?
Not sure I get it. Is properties a JSON column?
Yes these properties are representing the sql table schema in their docs, the normalized dataframe version has all the indentation as sub columns, do you think is it better to have these table schemas as JSON or as DataFrame?
In [32]: pd.json_normalize(data[d[17]])
Out[32]:
required title ... properties.version.title properties.version.type
0 [data_expected, date, geoRegion, pop, primary_... WeeklyDeathReasonIncomingData ... version string
[1 rows x 47 columns]
In [33]: list(_)
Out[33]:
['required',
'title',
'type',
'properties.altersklasse_covid19.enum',
'properties.altersklasse_covid19.title',
'properties.altersklasse_covid19.type',
'properties.data_expected.description',
'properties.data_expected.title',
'properties.data_expected.type',
'properties.date.description',
'properties.date.title',
'properties.date.type',
'properties.entries.description',
'properties.entries.title',
'properties.entries.type',
'properties.geoRegion.$ref',
'properties.geoRegion.title',
'properties.inz_entries.description',
'properties.inz_entries.title',
'properties.inz_entries.type',
'properties.inzsumTotal.description',
'properties.inzsumTotal.title',
'properties.inzsumTotal.type',
'properties.pop.description',
'properties.pop.title',
'properties.pop.type',
'properties.prct.description',
'properties.prct.title',
'properties.prct.type',
'properties.prctSumTotal.description',
'properties.prctSumTotal.maximum',
'properties.prctSumTotal.minimum',
'properties.prctSumTotal.title',
'properties.prctSumTotal.type',
'properties.primary_death_reason.description',
'properties.primary_death_reason.enum',
'properties.primary_death_reason.title',
'properties.primary_death_reason.type',
'properties.sumTotal.description',
'properties.sumTotal.title',
'properties.sumTotal.type',
'properties.type.enum',
'properties.type.title',
'properties.type.type',
'properties.version.description',
'properties.version.title',
'properties.version.type']
In the metadata format currently in use (you can see it on EGH), there is only three columns in the *_meta
tables :
What you are proposing is much more complex
@fccoelho sorry I'm not sure if I'm following here.. The JSON has the title
, type
and description
for each column in the table, the name properties
in the front of the column name in the normalized version, is because it is a JSON indentation of properties
and so on
Ok I got it now. The extraction seems to be fine. But I noticed that some of the columns do have more metadata, such as 'maximum' and 'minimum' besides the three basic descriptors.
Should properties be a separated dataframe or is it ok to store as JSON in a properties column?
:tada: This PR is included in version 2.0.5 :tada:
The release is available on:
2.0.5
Your semantic-release bot :package::rocket:
Description
add foph tests and weekly data format feature.
Related Issue
https://github.com/thegraphnetwork/EpiGraphHub/issues/184
Checklist
CODE_OF_CONDUCT.md
document.CONTRIBUTING.md
guide.chore(docs):
for Examples / docs / tutorials / dependencies updatefix:
for a bug fix (non-breaking change which fixes an issue).chore(improvement):
for an improvement (non-breaking change which improves an existing feature).feat:
for a new feature (non-breaking change which adds functionality).BREAKING CHANGE:
for a breaking change (fix or feature that would cause existing functionality to change).chore(security):
for a security fix.Note: The title prefix can be also expanded with more information inside the parenthesis. Some examples for a full title (prefix + message):
fix(login): Fix the login page
.feat(forecast): Add a weather forecast page
.chore(security-login): Add a weather forecast page
.