Central storage, management and access solution for important geospatial datasets. Developed by Toitū Te Whenua Land Information New Zealand.
A Geostore VPC must exist in your AWS account before deploying this application. At Toitū Te Whenua LINZ, VPCs are managed internally by the IT team. If you are deploying this application outside Toitū Te Whenua LINZ, you will need to create a VPC with the following tags:
You can achieve this by adding the networking_stack
(infrastructure/networking_stack.py)
into
app.py
before deployment as a dependency of application_stack
(infrastructure/application_stack.py
).
This infrastructure by default includes some Toitū Te Whenua LINZ-specific parts, controlled by
settings in cdk.json. To disable these, simply remove the context entries or set them to false
.
The settings are:
enableLDSAccess
: if true, gives Toitū Te Whenua LINZ Data Service/Koordinates read access to
the storage bucket.enableOpenTopographyAccess
: if true, gives OpenTopography read access to the storage bucket.One-time setup which generally assumes that you're in the project directory.
sudo usermod --append --groups=docker "$USER"
Set up an AWS Azure login shortcut like this in your .bashrc:
aws-azure-login() {
docker run --interactive --rm --tty --volume="${HOME}/.aws:/root/.aws" sportradar/aws-azure-login:2021062807125386530a "$@"
}
Install nvm
:
cd "$(mktemp --directory)"
wget https://raw.githubusercontent.com/nvm-sh/nvm/master/install.sh
echo 'b674516f001d331c517be63c1baeaf71de6cbb6d68a44112bf2cff39a6bc246a install.sh' | sha256sum --check && bash install.sh
Install Poetry:
cd "$(mktemp --directory)"
wget https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py
echo 'b35d059be6f343ac1f05ae56e8eaaaebb34da8c92424ee00133821d7f11e3a9c install-poetry.py' | sha256sum --check && python3 install-poetry.py
Install Pyenv:
sudo apt-get update
sudo apt-get install --no-install-recommends build-essential curl libbz2-dev libffi-dev liblzma-dev libncurses5-dev libreadline-dev libsqlite3-dev libssl-dev libxml2-dev libxmlsec1-dev llvm make tk-dev wget xz-utils zlib1g-dev
cd "$(mktemp --directory)"
wget https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer
echo '3aa49f2b3b77556272a80a01fe44d46733f4862dbbbc956002dc944c428bebd8 pyenv-installer' | sha256sum --check && bash pyenv-installer
Enable the above by adding the following to your ~/.bashrc
:
if [[ -e "${HOME}/.local/bin" ]]
then
PATH="${HOME}/.local/bin:${PATH}"
fi
# nvm <https://github.com/nvm-sh/nvm>
if [[ -d "${HOME}/.nvm" ]]
then
export NVM_DIR="${HOME}/.nvm"
# shellcheck source=/dev/null
[[ -s "${NVM_DIR}/nvm.sh" ]] && . "${NVM_DIR}/nvm.sh"
# shellcheck source=/dev/null
[[ -s "${NVM_DIR}/bash_completion" ]] && . "${NVM_DIR}/bash_completion"
fi
# Pyenv <https://github.com/pyenv/pyenv>
if [[ -e "${HOME}/.pyenv" ]]
then
PATH="${HOME}/.pyenv/bin:${PATH}"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
fi
Configure Docker:
sudo usermod --append --groups=docker "$USER"
Install project Node.js: nvm install
Install Go. This is required for running pre-commit
(shfmt hook)
Run ./reset-dev-env.bash --all
to install packages.
Enable the dev environment: . activate-dev-env.bash
.
Optional: Enable Dependabot alerts by email. (This is optional since it currently can't be set per repository or organisation, so it affects any repos where you have access to Dependabot alerts.)
Re-run ./reset-dev-env.bash
when packages change. One easy way to use it pretty much seamlessly is
to run it before every workday, with a crontab entry like this template:
HOME='/home/USERNAME'
0 2 * * 1-5 export PATH="${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${HOME}/.poetry/bin:/root/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/run/current-system/sw/bin" && cd "PATH_TO_GEOSTORE" && ./reset-dev-env.bash --all
Replace USERNAME
and PATH_TO_GEOSTORE
with your values, resulting in something like this:
HOME='/home/jdoe'
0 2 * * 1-5 export PATH="${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${HOME}/.poetry/bin:/root/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/run/current-system/sw/bin" && cd "${HOME}/dev/geostore" && ./reset-dev-env.bash --all
Re-run . activate-dev-env.bash
in each shell.
nix-shell
.lorri
and run
direnv allow .
to load the Nix shell whenever you cd
into the project.Restart your nix-shell
when packages change.
When setting up the project SDK point it to .venv/bin/python
, which is a symlink to the latest Nix
shell Python executable.
Enable Dependabot alerts by email. (This is optional since it currently can't be set per repository or organisation, so it affects any repos where you have access to Dependabot alerts.)
Configure a named AWS profile with permission to deploy stacks
Environment variables
GEOSTORE_ENV_NAME
: set deployment environment. For your personal development stack: set
GEOSTORE_ENV_NAME to your username.
export GEOSTORE_ENV_NAME="$USER"
Other values used by CI pipelines include: prod, nonprod, ci, dev or any string without spaces. Default: test.
AWS_DEFAULT_REGION
: The region to deploy to. For practical reasons this is the nearest
region.
export AWS_DEFAULT_REGION=ap-southeast-2
RESOURCE_REMOVAL_POLICY
: determines if resources containing user content like Geostore
Storage S3 bucket or application database tables will be preserved even if they are removed
from stack or stack is deleted. Supported values:
GEOSTORE_SAML_IDENTITY_PROVIDER_ARN
: SAML identity provider AWS ARN.
Bootstrap CDK (only once per profile)
cdk --profile=<AWS-PROFILE-NAME> bootstrap aws://unknown-account/ap-southeast-2
Deploy CDK stack
cdk --profile=<AWS-PROFILE-NAME> deploy --all
Once comfortable with CDK you can add --require-approval=never
above to deploy
non-interactively.
If you export AWS_PROFILE=<AWS-PROFILE-NAME>
you won't need the --profile=<AWS-PROFILE-NAME>
arguments above.
When Dependabot updates any Python dependencies in pip requirements files (*.txt
), make sure to
run ./generate-requirements-files.bash
with the relevant path to update the version of all its
dependencies. Sometimes this will revert the file to the previous state, which means that specific
dependency update is not compatible with the rest of the packages in the same file. For example,
say geostore/pip.txt
lists a package foo
, which depends on bar~=1.0
. This information is not
part of the requirements file, so Dependabot might update bar
to version 2.0, not being aware that
it's incompatible with the current version of foo
. generate-requirements-files.bash
effectively
re-checks this, creating a file with a compatible set of dependencies, which may mean reverting the
update done by Dependabot. In this case, simply close the Dependabot PR.
We're using poetry2nix to generate a Nix derivation from the poetry.lock
file, to allow people to
develop this project with either Nix or Poetry[^1]. Sometimes package updates will break the Nix
shell, usually because Python packages don't list all their build dependencies. These need to be set
up as a poetry2nix
override. First try upgrading nixpkgs using niv update
and re-running
nix-shell
; maybe the latest stable poetry2nix already has an override for this package. If not,
you either have to work one out yourself (see
upstream overrides) or
report it.
To add a development-only package: poetry add --dev --lock PACKAGE='*'
To add a production package:
poetry add --lock --optional PACKAGE='*'
.[tool.poetry.extras]
.Make sure to update packages separately from adding packages. Basically, follow this process
before running poetry add
, and do the equivalent when updating Node.js packages or changing
Docker base images:
git checkout -b update-python-packages origin/master
.poetry update --lock
. The rest of the steps are only necessary
if this step changes poetry.lock. Otherwise you can just change back to the original branch
and delete "update-python-packages".poetry add
.git rebase update-python-packages
.At this point any poetry add
commands should not result in any package updates other than those
necessary to fulfil the new packages' dependencies.
Rationale: Keeping upgrades and other packages changes apart is useful when reading/bisecting history. It also makes code review easier.
When there's a merge conflict in poetry.lock, first check whether either or both commits contain a package upgrade:
git checkout --ours -- poetry.lock && poetry lock --no-update
.git checkout --ours -- poetry.lock
or
git checkout --theirs -- poetry.lock
) and run poetry lock --no-update
to regenerate
poetry.lock
with the current package versions.poetry.lock
and run poetry lock --no-update
.Rationale: This should avoid accidentally down- or upgrading when resolving a merge conflict.
Update the code coverage minimum in pyproject.toml and the badge above on branches which increase it.
Rationale: By updating this continuously we avoid missing test regressions in new branches.
To minimise the chance of discrepancies between environments it is important to run the same (or as close as possible) version of Python in the development environment, in the pipeline, and in deployed instances. At the moment the available versions are constrained by the following:
When updating Python versions you have to check that all of the above can be kept at the same minor version, and ideally at the same patch level.
Prerequisites:
To launch full test suite, run pytest
.
To start debugging at a specific line, insert import ipdb; ipdb.set_trace()
.
To debug a test run, add --capture=no
to the pytest
arguments. You can also automatically start
debugging at a test failure point with --pdb --pdbcls=IPython.terminal.debugger:Pdb
.
jobs.<job_id>.runs-on
in .github sets the runner type per job. We should make sure all of these use the latest specific
("ubuntu-YY.MM" as opposed to "ubuntu-latest") Ubuntu LTS version, to make sure the version changes
only when we're ready for it.
To throw away the current cache (for example in case of a cache corruption), simply change the
CACHE_SEED
repository "secret",
for example to the current timestamp (date +%s
). Subsequent jobs will then ignore the existing
cache.
To do this, you'll need the dataset title and ID.
Once a dataset has some files in it, it's much harder to delete. This is intentional, to avoid
accidental loss of important and costly data. The following should be a complete set of actions to
delete a dataset, with template values in UPPERCASE
. Note the trailing slashes to make sure we
limit the commands to the specific dataset!
catalog.json
:
aws s3 cp s3://linz-geostore/catalog.json .
aws s3 cp catalog.json s3://linz-geostore/catalog.json
geostore dataset delete --id=DATASET_ID
aws s3 rm --recursive s3://linz-geostore/DATASET_TITLE/
.aws s3api list-object-versions --bucket=linz-geostore --prefix=DATASET_TITLE/ | jq .DeleteMarkers
.We aim to release at the end of each agile sprint (fortnightly), or whenever required (e.g. bugfix, feature rollout). Each release triggers a production deployment via GitHub Actions.
Geostore follows semantic versioning. The release is tagged with
release-major.minor.patch
(e.g. release-0.11.0).
The simplest way to deploy a release is to follow the process
recommended by GitHub.
Release notes can be automatically generated from GitHub. This is optional and provides a list of
commit titles since the last release. Commits from dependabot are excluded from automatically
generated release notes, as specified in .github/release.yml
. You should always check the release
notes and update accordingly as needed.
Note: Geostore has no rollback process. Any fixes will need to be carried out in a roll forward basis.
[^1]:
When using Nix, make sure to remove the .venv
directory. Mixing Nix and Poetry leads to weird
behaviour.