This project aims to perform an analysis of global rivers using the SRTM 30 dataset and the LSDTopoTools package. The code presented here could be adapted to other projects which require the bulk downloading and processing of SRTM data based on polygon areas of interest.
This workflow has been designed to run on the QMUL Apocrita supercomputer (which uses Univa Grid Engine as its scheduler) so some modification of the scripts will most likely be needed if this is to be run in a different environment.
requirements.txt
LSD_code/
.This section provides an overview of the files contained within this repository and how to use them.
climate_zones/
The files within this directory are generated using some of the preprocessing scripts. The initial input data, taken from this paper is provided as a geotiff
. This raster is split into a series shapefiles, which are subdivided until each climate sub zone is small enough to be processed in a sane amount of time and with a sane amount of memory. The final files which should be used are contained within singlepart_files_split/
, the other folders contain intermediate data which is preserved for debugging purposes.
An additional folder, singlepart_files_split_original/
, contains the climate zones prior to the lake processing step.
Full code is for these preprocessing steps can be found in https://zenodo.org/record/3257656.
preprocessing/
srtm_filenames.lst
is the list of every SRTM 30 tile stored on the OpenTopography servers.
parse_srtm_filenames.py
This script processes the list of SRTM tiles into a json file, srtm_coords.json
, which stores the corner coordinates of each SRTM tile, keyed with the filename of the tile on the OpenTopography server.
get_bbox.py
This script identifies the bounding box coordinates, the UTM zone and if the bottom left corner of the bounding box is in the Northern or Southern hemisphere for each shapefile generated using quadpoly.py
. The results are stored in ../processing/bboxes.json
.
get_dl_list.py
The final preprocessing script, this loads the bounding box of each shapefile from ../processing/bboxes.json
and calculates all of the SRTM tiles which intersect this bounding box. From this intersection a json containing a list of download urls, keyed by the climate sub zone name is written to ../processing/download_links.json
.
processing/
get_urls.py
Simple command line script to return a list of the download urls for a given input climate sub zone filename.
runner.sh
Main script which handles the download, merging, clipping and reprojection of the SRTM tiles, runs the file through the LSD code and postprocesses the outputs. Takes six input arguments:
$1 - shapefile name without .shp
$2 - utm zone
$3 - north or south
$4 - drainage threshold in pixels
$5 - min basin size in pixels
$6 - max basin size in pixels
res_runner.sh
Script used to get the resolution data for each tile prior to the full analysis being run. Takes three input arguments:
$1 - shapefile name without .shp
$2 - utm zone
$3 - north or south
build_array_params.py
Use this script to generate a file containing the matrix of parameters needed to deploy an array job. Takes 2 input arguments, the minimum and maximum number of SRTM tiles to be included in the job. This allows multiple array jobs to be created with different memory requirements.
Array_{1..5}.sh
Example UGE job scripts used to deploy array job composed of multiple instances of runner.sh
.
Array_res.sh
Example UGE job script used to deploy array job to get the resolution data. Calls res_runner.sh
.
array_params_*.txt
Output parameter files generated by build_array_params.py
.
bboxes.json
JSON containing the bounding boxes of each climate zone tile.
download_links.json
JSON containing the download links needed to build the topographic data for each climate zone tile.
threshold_areas.json
JSON containing the basin min and max areas, and the channel extraction threshold for each climate zone tile, ensuring a consistent channel extraction threshold across all of the analysis.
LSD_code/
This project provides a driver file gdbm_analysis.cpp
and associated makefile, gdbm_analysis.make
and lightly modified versions of LSDChiTools.cpp
, LSDChiTools.hpp
and LSDJunctionNetwork.cpp
to generate the required output data from the clipped SRTM tiles. These patches should be applied to this 2018 release of LSDTopoTools, by copying these files into the appropriate folders in the LSDTopoTools installation and making the driver file with the command: make -f gdbm_analysis.make
. If you have run this makefile before, be sure to clean things before applying the patch, make -f gdbm_analysis.make clean
.
If you need guidance on getting started with LSDTopoTools, see this user guide.
postprocessing/
Following the execution of the LSD code, export_rivers.py
is used to identify the longest river in each drainage basin and export it to its own csv
file.
secondary_analysis.py
Must be run with one command line argument, to generate secondary statistics, including NCI, for all of the generated river files from the processing steps:
$ python secondary_analysis.py <climate zone>
This file will contain the NCI value for each river, alongside its relief, flow length, overall gradient, Aridity Index statistics, a straightness metric and a pit fill metric. This code can be run on an UGE HPC using the secondary_job.sh
job script, which reads the climate zone names from params.txt
crossed_zone_count.py
is used to generate statistics about the number of basins excluded from the analysis due to crossing tile boundaries.
lakes/
We have used the Global Lakes and Wetlands Database to clip out lakes from our topographic data, to ensure we are only analysing real channels. A paper describing the data can be found here. We are only using the level 1 data, as the level 2 data is not at an appropriate resolution for this global study, details of this process can be seen in this issue.
clipper.py
is the code used to process the data downloaded from the above links into the appropriate format
summary_figure/
and zone_figure/
Files used to generate some figures.
This section outlines the steps required to go from the Koppen climate zone raster to the final processed files, via a series of preprocessing steps, an automated processing workflow and some postprocessing.
reclassify.py
multi_to_single.py
quadpoly.py
aws s3 ls s3://raster/SRTM_GL1/ --recursive --endpoint-url https://opentopography.s3.sdsc.edu --no-sign-request
parse_srtm_filenames.py
get_bbox.py
get_dl_list.py
We need to get the data resolution for each climate zone tile, so that we can set a globally consistent set of channel extraction parameters.
build_array_params_resolution.py
build_array_params_resolution.py
. See the example for UGE: array_res.sh
.area_thresholds.py
to generate the final JSON file needed to run the main HPC processing steps.build_array_params.py
build_array_params.py
. See the examples for UGE: array_{1..5}.sh
.secondary_analysis.py
, which can be run as an HPC job using secondary_job.sh
The Koppen climate zones are described by letter codes in the original paper. We have merged some of the similar zones to allow us to identify more general trends in the data to emerge.
This table contains the mappings between the letter codes used in the paper and our numerical codes.
Letter Code | Classification | Code (original range) |
---|---|---|
Af | Tropical-Rainforest | 1 |
Am | Tropical-Monsoon | 2 |
Aw | Tropical-Savannah | 3 |
BWh | Arid-Desert-Hot | 4 |
BWk | Arid-Desert-Cold | 5 |
BSh | Arid-Steppe-Hot | 6 |
BSk | Arid-Steppe-Cold | 7 |
Cs | Temperate-Dry summer | 8 (8, 9) |
Cw | Temperate-Dry winter | 11 (11, 12, 13) |
Cf | Temperate-Without dry season | 14 (14, 15, 16) |
Ds | Cold-Dry summer | 17 (17, 18, 19, 20) |
Dw | Cold-Dry Winter | 21 (21, 22, 23, 24) |
Df | Cold-Without dry season | 25 (25, 26, 27, 28) |
As the climate zones are not contiguous, we need to be able to split each climate zone into a series of individual polygons, to achieve this, a sub zone ID is added to each climate zone so that the 5th polygon of zone 4 would be referred to as 4_5
. Note that no information is contained within these sub zone IDs, we cannot assume any spatial relationship between sub zones based on their numerical value.
In some cases the sub zones are still too big to be processed efficiently. These are further divided using a quadtree-like algorithm. In order to ensure files are never overwritten whilst running this recursive algorithm, the output subsets of a given climate sub zone are given an additional unique id: 4_5_a6a60415_78a8_4ed7_8f74_f9fbfceb09f5
. Again, these additional IDs confer no other information and serve solely to ensure the uniqueness of each climate sub zone.
row,col,lat,long,elevation,flow length,drainage area,basin key,flowdir,aridity index,pit flag,perimeter pixel count,area pixel count,easting,northing