Support vector population data as an option

carlhiggs commented 1 year ago

Currently population data is configured using a raster grid (eg Global Human Settlements Layer population data). This is vectorised and then used to take the average of sample point estimates, that are then further aggregated for the overall region as a population weighted average of the grid small area estimates.

Many countries, including Australia and Spain, supply population data using vector file formats for administrative areas --- more commonly than population grids. For example, if you want information on demographic sub-groups, or want to communicate using official statistical areas the raster grid may not be the best option. Also, some raster data products (like GHS-POP) are modelled estimates rather than direct reflections of census counts, and so again, in some instances official data may be preferred.

So, support for using administrative population data in a vector file format has been requested by some of our early adopters (including @xavidelclos).

In principle, a modification to the software to allow optional configuration of this alternate data format should be do-able --- e..g if a raster format is configured, analysis proceeds as is currently the case to develop a vectorised small area grid; alternately, if a vector format is configured, this is imported directly to serve as an equivalent small area vector grid (if not necessarily of equal areas).

There are currently some assumptions around field names and data structures that may have to be thought through with this modification, but in principle, it is do-able and may make it easier for creating population specific urban indicators, and for sensitivity analyses comparing gridded population data with official census data products distributed in vector formats.

carlhiggs commented 1 year ago

@xavidelclos @marcdmallafre I have just drafted the implementation for using custom vector data, tested using the 2021 Catalunya population data with demographic strata from Idescat (https://biblio.idescat.cat/publicacions/Record/21104 ; the download link was working again, so used this). I think it works quite well! There is a small caveat for population density indicators in the current implementation. Will be keen to hear your thoughts when you get a chance to read the details below.

Implementation of optional use of vector population data

So, as per #298 I've also allowed for all datasets to be defined in the region configuration yml directly as an alternative to using datasets.yml to define shared datasets. So, the configuration files for a demographically stratified analysis of indicators for Tarragona, Catalunya are below. These are used to specify both the source data and a field to be used for the population estimate. To allow for comparisons while using data, the relevant field is renamed to 'pop_est' and the population layer itself is renamed to reflect an alias for the vector population data and the configured field used for estimates (e.g. population_catalunya_2021_p_15_64)

Tarragona 2021 population, 15 to 64 years (click to view)

``` name: Tarragona year: 2023 country: Spain country_code: ES continent: Europe crs: name: ETRS89 / UTM zone 31N standard: EPSG srid: 25831 study_region_boundary: data: urban_query notes: Using a query of the Global Human Settlements layer to identify the urban region of Tarragona population: alias: catalunya_2021 name: "Població de Catalunya georeferenciada a 1 de gener de 2021" data_dir: population_grids/gridpoblacio01012021/gridpoblacio_01012021.shp vector_population_data_field: P_15_64 crs_name: ETRS89 / UTM zone 31N crs_standard: ESRI crs_srid: 25831 source_url: https://www.idescat.cat/serveis/biblioteca/docs/bib/publicacions/gridpoblacio01012021.zip provider: Institut d’Estadística de Catalunya year_published: 2023 year_target: 2021 date_acquired: 20230608 licence: CC BY 4.0 licence_url: https://creativecommons.org/licenses/by/4.0/deed.ast data_type: vector pop_min_threshold: 1 # urban sample points intersecting grid cells with estimated population less than this will be excluded from analysis citation: "Població De Catalunya Georeferenciada a 1 De Gener De .. Barcelona: Generalitat de Catalunya. Institut d'Estadística de Catalunya, 2016. https://biblio.idescat.cat/publicacions/Record/21104" OpenStreetMap: data_dir: OpenStreetMap/cataluna-latest_2023-06-08.osm.pbf source: OpenStreetMap.fr publication_date: 20230221 licence: ODbL licence_url: https://opendatacommons.org/licenses/odbl/ url: https://download.geofabrik.de/europe/spain/cataluna-latest.osm.pbf note: network: osmnx_retain_all: false buffered_region: true polygon_iteration: false connection_threshold: intersection_tolerance: 12 urban_region: name: "Global Human Settlements urban centres: 2015 (EU JRC, 2019)" data_dir: urban_regions/GHS_STAT_UCDB2015MT_GLOBE_R2019A/GHS_STAT_UCDB2015MT_GLOBE_R2019A_V1_2.gpkg data_type: vector epsg_name: WGS84 epsg: 4326 licence: CC BY 4.0 licence_url: https://creativecommons.org/licenses/by/4.0/deed.ast citation: "Florczyk, A. et al. (2019): GHS Urban Centre Database 2015, multitemporal and multidimensional attributes, R2019A. European Commission, Joint Research Centre (JRC). https://data.jrc.ec.europa.eu/dataset/53473144-b88c-44bc-b4a3-4583ed1f547e" covariates: E_EC2E_T15: Units: tonnes per annum Unit description: tonnes per annum Description: Total emission of CO 2 from the transport sector, using non-short-cycle-organic fuels in 2015 E_EC2O_T15: Units: tonnes per annum Unit description: tonnes per annum Description: Total emission of CO 2 from the energy sector, using short-cycle-organic fuels in 2015 E_EPM2_T15: Units: tonnes per annum Unit description: tonnes per annum Description: Total emission of PM 2.5 from the transport sector in 2015 E_CPM2_T14: Units: µg per cubic metre Unit description: micrograms per cubic meter Description: Total concertation of PM 2.5 for reference epoch 2014 urban_query: GHS:UC_NM_MN=='Tarragona' and CTR_MN_NM=='Spain' country_gdp: classification: High-income reference: The World Bank. 2020. World Bank country and lending groups. https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups covariate_data: urban_query gtfs_feeds: policy_review: notes: Example illustrating use of vector data for population comparions. This sub-analysis is configured to use population estimates for persons aged 15 to 64 years using official data from the Catalunya census at 1 January 2021. reporting: publication_ready: False # Set 'publication_ready' to 'True' once you have checked results, updated the summary and are ready to publish doi: # It is recommended to register a DOI for your report, e.g. through figshare, zenodo or other repository images: # Store images in the process/configuration/assets folder. # Update file name, description and credit as required. 1: file: Example image of a vibrant, walkable, urban neighbourhood - landscape.jpg description: Example image of a vibrant, walkable, urban neighbourhood with diverse people using active modes of transport and a tram (replace with a photograph, customised in region configuration) credit: Carl Higgs, Bing Image Creator, 2023 2: file: Example image of a vibrant, walkable, urban neighbourhood - square.jpg description: Example image of a vibrant, walkable, urban neighbourhood with diverse people using active modes of transport and a tram (replace with a photograph, customised in region configuration) credit: Carl Higgs, Bing Image Creator, 2023 languages: English: name: Tarragona (persons aged 15 to 64) country: Spain summary: | After reviewing the results, update this summary text to contextualise your findings, and relate to external text and documents (e.g. using website hyperlinks). Spanish - Spain: name: Tarragona (personas de edad 15 a 64) country: España summary: | Después de revisar los resultados, actualice este texto de resumen para contextualizar sus hallazgos y relacionarlo con textos y documentos externos (por ejemplo, utilizando hipervínculos de sitios web). ```

Tarragona 2021 population, 65 years and older (click to view)

``` name: Tarragona year: 2023 country: Spain country_code: ES continent: Europe crs: name: ETRS89 / UTM zone 31N standard: EPSG srid: 25831 study_region_boundary: data: urban_query notes: Using a query of the Global Human Settlements layer to identify the urban region of Tarragona population: alias: catalunya_2021 name: "Població de Catalunya georeferenciada a 1 de gener de 2021" data_dir: population_grids/gridpoblacio01012021/gridpoblacio_01012021.shp vector_population_data_field: P_65_I_MES crs_name: ETRS89 / UTM zone 31N crs_standard: ESRI crs_srid: 25831 source_url: https://www.idescat.cat/serveis/biblioteca/docs/bib/publicacions/gridpoblacio01012021.zip provider: Institut d’Estadística de Catalunya year_published: 2023 year_target: 2021 date_acquired: 20230608 licence: CC BY 4.0 licence_url: https://creativecommons.org/licenses/by/4.0/deed.ast data_type: vector pop_min_threshold: 1 # urban sample points intersecting grid cells with estimated population less than this will be excluded from analysis citation: "Població De Catalunya Georeferenciada a 1 De Gener De .. Barcelona: Generalitat de Catalunya. Institut d'Estadística de Catalunya, 2016. https://biblio.idescat.cat/publicacions/Record/21104" OpenStreetMap: data_dir: OpenStreetMap/cataluna-latest_2023-06-08.osm.pbf source: OpenStreetMap.fr publication_date: 20230221 licence: ODbL licence_url: https://opendatacommons.org/licenses/odbl/ url: https://download.geofabrik.de/europe/spain/cataluna-latest.osm.pbf note: network: osmnx_retain_all: false buffered_region: true polygon_iteration: false connection_threshold: intersection_tolerance: 12 urban_region: name: "Global Human Settlements urban centres: 2015 (EU JRC, 2019)" data_dir: urban_regions/GHS_STAT_UCDB2015MT_GLOBE_R2019A/GHS_STAT_UCDB2015MT_GLOBE_R2019A_V1_2.gpkg data_type: vector epsg_name: WGS84 epsg: 4326 licence: CC BY 4.0 licence_url: https://creativecommons.org/licenses/by/4.0/deed.ast citation: "Florczyk, A. et al. (2019): GHS Urban Centre Database 2015, multitemporal and multidimensional attributes, R2019A. European Commission, Joint Research Centre (JRC). https://data.jrc.ec.europa.eu/dataset/53473144-b88c-44bc-b4a3-4583ed1f547e" covariates: E_EC2E_T15: Units: tonnes per annum Unit description: tonnes per annum Description: Total emission of CO 2 from the transport sector, using non-short-cycle-organic fuels in 2015 E_EC2O_T15: Units: tonnes per annum Unit description: tonnes per annum Description: Total emission of CO 2 from the energy sector, using short-cycle-organic fuels in 2015 E_EPM2_T15: Units: tonnes per annum Unit description: tonnes per annum Description: Total emission of PM 2.5 from the transport sector in 2015 E_CPM2_T14: Units: µg per cubic metre Unit description: micrograms per cubic meter Description: Total concertation of PM 2.5 for reference epoch 2014 urban_query: GHS:UC_NM_MN=='Tarragona' and CTR_MN_NM=='Spain' country_gdp: classification: High-income reference: The World Bank. 2020. World Bank country and lending groups. https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups covariate_data: urban_query gtfs_feeds: policy_review: notes: Example illustrating use of vector data for population comparions. This sub-analysis is configured to use population estimates for persons aged 65 years and older using official data from the Catalunya census at 1 January 2021. reporting: publication_ready: False # Set 'publication_ready' to 'True' once you have checked results, updated the summary and are ready to publish doi: # It is recommended to register a DOI for your report, e.g. through figshare, zenodo or other repository images: # Store images in the process/configuration/assets folder. # Update file name, description and credit as required. 1: file: Example image of a vibrant, walkable, urban neighbourhood - landscape.jpg description: Example image of a vibrant, walkable, urban neighbourhood with diverse people using active modes of transport and a tram (replace with a photograph, customised in region configuration) credit: Carl Higgs, Bing Image Creator, 2023 2: file: Example image of a vibrant, walkable, urban neighbourhood - square.jpg description: Example image of a vibrant, walkable, urban neighbourhood with diverse people using active modes of transport and a tram (replace with a photograph, customised in region configuration) credit: Carl Higgs, Bing Image Creator, 2023 languages: English: name: Tarragona (persons aged 65 and older) country: Spain summary: | After reviewing the results, update this summary text to contextualise your findings, and relate to external text and documents (e.g. using website hyperlinks). Spanish - Spain: name: Tarragona (personas de edad 65 i mes) country: España summary: | Después de revisar los resultados, actualice este texto de resumen para contextualizar sus hallazgos y relacionarlo con textos y documentos externos (por ejemplo, utilizando hipervínculos de sitios web). ```

So, the key difference really is vector_population_data_field: P_15_64 is changed to vector_population_data_field: P_65_I_MES, although I also changed some of the comments and prose bits to reflect this when reporting.

Here's an example of the summary of differences when viewed using the new local browser web app:

... now -- looking at this I start to see the limitations of this particular implementation when applying for sub-population analyses such as this:

yes, it uses the configured population to evaluate and weight statistics
however, it does this very literally -- when evaluating local neighbourhood population density, this currently will only use the sub-population statistic not the broader 'total population', which really would be the relevant statistic.
this means that
- the differences in overall estimates for population access to amenities will relate to differences in the spatial distribution of these populations and the average access to amenities for that cohort (this is as intended, and should be useful)
- but the population density and walkability statistics are only considering the cohort itself -- hence the lower local neighbourhood population density for over 64 (this probably is not intended, and really there should be an option to configure to use total population for the denominator).

Perhaps this is useful enough for now however to allow for stratified analyses, while noting some population-specific indicators need to have care taken with their interpretation as they relate to the population sub-group itself, not the broader population.

This is currently on the enhancements branch, if you want to give it a go.

carlhiggs commented 1 year ago

In the commit linked above I added in optional specification of a population_denominator variable that can be used when evaluating neighbourhood population density for stratified population sub-group analyses. This means that the overall population density is used, while the population of interest is used for weighting indicators for cohort sub-group specific estimates.

In the case of Tarragona for persons 15 to 64 vs 65 and older, this now appears more sensible with similar neighbourhood population density estimates overall. The difference reflects spatial variation in the sub-group population, while both groups used the overall population for evaluating the density. So, on average persons aged 15 to 64 live in neighbourhoods that would be appear slightly less walkable on average compared to persons aged 65 years and older -- although the differences are smaller, and both groups tend to live in the more walkable parts of the city. The older cohort have better access to large public open space, convenience stores and fresh food markets, based on the data used to identify these locations using OpenStreetMap. Consequently, the score for access to daily living amenities is also higher for the older group. Overall, the combined effect that older persons tend to live in more densely populated neighbourhoods with better access to amenities results in the difference in population weighted walkability scores. Not weighting for sub-group population, the spatial average of walkability is basically equivalent (which makes sense, given its the same city and using the same denominator population to evaluate density).

I think that's useful, now that we can specify that the analyses could be configured to use population_denominator: TOTAL.

carlhiggs commented 1 year ago

oh, here's an example of the spatial distribution of walkability estimates for persons 15 to 64, using the vector grid official statistics data instead of raster modelled population estimates data:

xavidelclos commented 1 year ago

Carl, this looks great! It will be very useful. In the following weeks we could try to run it with other vector layers such as the census tracts. @marcdmallafre could try to run it for some of the Spanish cities if the current version of the software allows for it.

carlhiggs commented 1 year ago

Hi @xavidelclos , the current main branch zip file should work with this now --- I'm waiting to make a few more changes before doing the next release, so it's not in the list of formal releases yet. If you give it a go @marcdmallafre let me know how it goes!

healthysustainablecities / global-indicators

Support vector population data as an option #295

Implementation of optional use of vector population data