minus34 / gnaf-loader

A quick way to get started with Geoscape's open GNAF & Admin Boundaries
Apache License 2.0
182 stars 66 forks source link
addresses admin-boundaries australia g-naf geocoded-data geoparquet geoscape geoscape-gnaf gnaf gnaf-loader open-data psma-administrative-boundaries psma-gnaf

gnaf-loader

A quick way to load the complete Geocoded National Address File of Australia (GNAF) and Australian Administrative Boundaries into Postgres, simplified and ready to use as reference data for geocoding, analysis, visualisation and aggregation.

What's GNAF?

Have a look at these intro slides (PDF), as well as the data.gov.au page.

There are 4 options for loading the data

  1. Run the load-gnaf Python script and build the database yourself in a single step
  2. Pull the database from Docker Hub and run it in a container
  3. Download the GNAF and/or Admin Bdys Postgres dump files & restore them in your Postgres 14+ database
  4. Use or download Geoparquet and Parquet Files in S3 for your data & analytics workflows; either in AWS or your own platform.

Option 1 - Run load.gnaf.py

Running the Python script takes 30-120 minutes on a Postgres server configured to take advantage of the RAM available.

You can process the GDA94 or GDA2020 version of the data - just ensure that you download the same version for both GNAF and the Administrative Boundaries. If you don't know what GDA94 or GDA2020 is, download the GDA94 versions (FYI - they're different coordinate systems)

Performance

To get a good load time you'll need to configure your Postgres server for performance. There's a good guide here, noting it's a few years old and some of the memory parameters can be beefed up if you have the RAM.

Pre-requisites

Process

  1. Download Geoscape GNAF from data.gov.au (GDA94 or GDA2020)
  2. Download Geoscape Administrative Boundaries from data.gov.au (download the ESRI Shapefile (GDA94 or GDA2020) version)
  3. Unzip GNAF to a directory on your Postgres server
  4. Unzip Admin Bdys to a local directory
  5. Alter security on those directories to grant Postgres read access
  6. Create the target database (if required)
  7. Add PostGIS to the database (if required) by running the following SQL: CREATE EXTENSION postgis
  8. Check the available and required arguments by running load-gnaf.py with the -h argument (see command line examples below)
  9. Run the script, come back in 30-120 minutes and enjoy!

Command Line Options

The behaviour of gnaf-loader can be controlled by specifying various command line options to the script. Supported arguments are:

Required Arguments

Postgres Parameters

Optional Arguments

Example Command Line Arguments

Advanced

You can load the Admin Boundaries without GNAF. To do this: comment out steps 1, 3 and 4 in def main.

Note: you can't load GNAF without the Admin Bdys due to dependencies required to split Melbourne and to fix non-boundary locality_pids on addresses.

Attribution

When using the resulting data from this process - you will need to adhere to the attribution requirements on the data.gov.au pages for GNAF and the Admin Bdys, as part of the open data licensing requirements.

WARNING:

IMPORTANT:

Option 2 - Run the database in a docker container

GNAF and the Admin Boundaries are ready to use in Postgres in an image on Docker Hub.

Process

  1. In your docker environment pull the image using docker pull minus34/gnafloader:latest
  2. Run using docker run --publish=5433:5432 minus34/gnafloader:latest
  3. Access Postgres in the container via port 5433. Default login is - user: postgres, password: password

Note: the compressed Docker image is 8Gb, uncompressed is 25Gb

WARNING: The default postgres superuser password is insecure and should be changed using:

ALTER USER postgres PASSWORD '<something a lot more secure>'

Option 3 - Load PG_DUMP Files

Download Postgres dump files and restore them in your database.

Should take 15-60 minutes.

Pre-requisites

Process

  1. Download the GNAF dump file or GNAF GDA2020 dump file (~2.0Gb)
  2. Download the Admin Bdys dump file or Admin Bdys GDA2020 dump file (~2.8Gb)
  3. Edit the restore-gnaf-admin-bdys.bat or .sh script in the supporting-files folder for your dump file names, database parameters and for the location of pg_restore
  4. Run the script, come back in 15-60 minutes and enjoy!

Option 4 - Geoparquet Files in S3

Geoparquet versions of the spatial tables, as well as parquet versions of the non-spatial tables, are in a public S3 bucket for use directly in an application or service. They can also be downloaded using the AWS CLI.

Geometries have WGS84 lat/long coordinates (SRID/EPSG:4326). A sample query for analysing the data using Apache Sedona, the spatial extension to Apache Spark is in the spark folder.

The files are here: s3://minus34.com/opendata/geoscape-202408/geoparquet/

AWS CLI Examples:

DATA LICENSES

Incorporates or developed using G-NAF © Geoscape Australia licensed by the Commonwealth of Australia under the Open Geo-coded National Address File (G-NAF) End User Licence Agreement.

Incorporates or developed using Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).

DATA CUSTOMISATION

GNAF and the Admin Bdys have been customised to remove some of the known, minor limitations with the data. The most notable are: