Automatic generation of renewable energy input data for energy models in arbitrary world regions using public datasets. Written in Julia.
The work here has been described in a scientific paper submitted to Energy Strategy Reviews along with its supplementary material.
This package uses several large datasets and requires a lot of disk space: roughly 10 GB + 29 GB/year of reanalysis data stored. Also, at least 50 GB of additional disk space will be required temporarily. Please ensure that you have enough space available (perhaps on a secondary hard drive) before proceeding with the data download. You also need a minimum of 8 GB of RAM memory.
Make sure you are on Julia v1.6 or higher. Type ]
to enter Julia's package mode, then:
(@v1.6) pkg> add https://github.com/niclasmattsson/GlobalEnergyGIS
Grab some coffee, because installing and compiling dependencies can take quite some time to run. If you don't yet have a Copernicus account, you can create one while you wait for the compilation to complete.
The GlobalEnergyGIS package makes use of the following datasets. By using this package, you agree to abide by their terms of use. Please click the links to open the terms of use in your browser.
GlobalEnergyGIS is based on several public datasets, most notably the ERA5 reanalysis by the ECMWF (European Centre for Medium-Range Weather Forecasts). The reanalysis data is used to calculate hourly solar insolation and wind speeds for any location on earth. To download ERA5 data you need to create a free account at the Copernicus Data Service (CDS).
Now we will create two config files that will keep track of your preferred data storage path and your Copernicus ID.
Since the datasets will require a lot of disk space (see hardware requirements above), the package allows you to store all data on a secondary disk drive if you have one available. For example, if you have a fast SSD as a boot drive and a larger HDD, then consider storing the data on the HDD. It may also be a good idea to choose a storage location which doesn't get automatically backuped if your backup storage or bandwidth is limited.
Login to your Copernicus account and go to your profile page (click your name in the upper right), then scroll down to the API key section. Here you will find your user ID (UID) and the long API key string that needs to be copy/pasted to the command below.
Run saveconfig(folder_path, Copernicus_UID, "your API key", agree_terms=true)
and substitute your own
Copernicus data to create a small configuration file (.cdsapirc) in your home directory. Use forward slash
/
or double backslashes \\
as folder delimiters in the path string. For example:
julia> using GlobalEnergyGIS
julia> saveconfig("D:/GISdata", 12345, "abcdef123-ab12-cd34-ef56-abcdef123456", agree_terms=true)
The argument agree_terms=true
is required to continue. By including it you agree to the terms of use of all
datasets listed above. The first time you run using GlobalEnergyGIS
there will be another delay (a minute
or two) while Julia precompiles the dependencies.
To agree to Copernicus terms, you must download a small amount of test data once using the web interface (otherwise you will get errors in step 5 below). Visit the CDS web interface, select one checkbox in each parameter section (e.g. Product type, Variable, Pressure level, etc.), and finally click the "Agree terms" and "Submit form" buttons as in the screenshot below. If you have any problem with this, follow step 2 in these detailed instructions.
Now we will download the auxiliary public datasets listed above. The following command will download them all to the data folder you supplied when you created the configuration file.
julia> download_datasets()
A total of 17 files will be downloaded and unpacked. This will probably take 30 minutes or so depending on your internet connection.
Several of the datasets are vector data (shapefiles). To speed up and simplify the subsequent processing, we will rasterize them all to a common resolution (0.01 degrees, or roughly 1 km at the equator).
julia> rasterize_datasets(cleanup=:all)
This command will automatically delete the original datasets to save disk space. Use the argument
cleanup=:limited
to keep the original files, or cleanup=:none
to also keep intermediate raster files.
Run download_and_convert_era5(data_year)
to begin downloading the ERA5 wind, solar and temperature data for a given year. About 26 GB of raw data must be downloaded (split into 24 files of 1.1 GB) per variable, per year. So 72 downloads in total. It will likely take several hours to download, depending on your internet connection and whether or not there is congestion in the CDS download queue.
julia> download_and_convert_era5(2018)
After the raw data for a variable has been downloaded, it is immediately aggregated and converted into more suitable file format (HDF5) and recompressed to save disk space. To minimize disk usage we also throw away data we won't need (by default we discard wind direction, far offshore wind speeds and solar insolation over oceans). This will reduce disk usage to about 31 GB per year of ERA5 data in total (on average 14.7 GB for solar, 8.6 GB for wind and 7.4 GB for temperatures - solar needs roughly twice the space since we need to store both direct and diffuse solar insolation).
There are three main steps:
The output of steps 2 and 3 are directly read by Supergridmodel (a companion energy system model designed to work with this GIS package).
Run create_scenario_datasets(SSPscenario, target_year)
, where SSPscenario
is one of "SSP1"
, "SSP2"
or "SSP3"
and target_year is one of 2020, 2030, ..., 2100.
julia> create_scenario_datasets("SSP2", 2050)
Run saveregions(regionset_name, region_definitions)
, where regionset_name
is a string and
region_definitions
is a matrix as specified below. Then run makedistances(regionset_name)
to determine
which regions are connected onshore and offshore and calculate distances between population-weighted region
centers.
julia> saveregions("Europe13", europe13)
julia> makedistances("Europe13")
Here europe13
is a region matrix defined in regiondefinitions.jl
, but you can refer to your own region
matrices defined elsewhere. See the next section for syntax examples. To get visual confirmation of the
results, run createmaps(regionset_name)
to create images of onshore and offshore region territories
(in /GISdata_folder_path/output
).
julia> createmaps("Europe13")
Regions are specified using an n x 2 matrix, where the first column contains names of model regions, and the
second column contains information on which subregions are included in each main region using GADM or NUTS
subregion names. This is facilitated using the GADM()
and NUTS()
helper functions. Here's a simple
example of making a 4-region definition matrix of Scandiavian countries using both administrative border
datasets:
scandinavia4 = [
"SWE" GADM("Sweden")
"NOR" GADM("Norway")
"DEN" GADM("Denmark")
"FIN" GADM("Finland", "Åland") # in GADM, the island of Åland is a separate level-0 region
]
scandinavia4_nuts = [
"SWE" NUTS("SE")
"NOR" NUTS("NO")
"DEN" NUTS("DK")
"FIN" NUTS("FI")
]
Subregions of the same level can be concatenated by listing multiple arguments to the GADM()
or NUTS()
call. For example, mainland Portugal can be defined by NUTS("PT11","PT15","PT16","PT17","PT18")
. This will
exclude islands that are otherwise included in NUTS("PT")
.
Subregion names (codes) are unique in NUTS but not always in GADM. For this reason, GADM subregions must indicate their parent regions. Here are two examples:
gadm_subregions = [
"India_E" GADM(["India"], "Odisha", "Jharkhand", "West Bengal", "Bihar")
"Öland" GADM(["Sweden","Kalmar"], "Borgholm", "Mörbylånga")
]
If the first argument to a GADM()
call is a vector, then the remaining arguments are subregions to the last
vector element. Here India_E (Eastern India) is defined using four level-1 subregions of the "India" level-0
region. Next we define the Swedish island of Öland using its two municipalities that are level-2 subregions
of Kalmar, which is itself a level-1 subregion of Sweden. If the first argument to a GADM()
call is not a
vector, then all arguments are assumed to be level-0 regions.
This vector syntax is not used in NUTS()
calls since all subregion code names are unique.
To concatenate different levels of subregions or to mix NUTS and GADM calls in the same region, use a tuple of GADM and NUTS calls by enclosing them in parentheses:
concatenation_examples = [
"China_SC" (GADM(["China"], "Henan","Hubei","Hunan","Guangdong","Guangxi","Hainan"), GADM("Hong Kong","Macao"))
"France" (NUTS("FR"), GADM("Monaco"))
]
Here China_SC (southcentral) is defined using six level-1 subregions of China, in addition to Hong Kong and Macao which are level-0 regions in GADM. Next, we define a France NUTS region that includes Monaco (which is not included in NUTS) by concatenating the GADM definition of Monaco.
For more syntax examples, see regiondefinitions.jl
(in the GlobalEnergyGIS /src folder).
Maps of NUTS regions can be found here..
There is a simple helper function subregion()
to facilitate finding the names of subregions. Note that the
GADM version takes multiple subregion arguments while the NUTS version only takes a single argument (and matches
the beginning of the subregion name). The function will return a vector of subregion names.
julia> subregions(GADM)
256-element Array{String,1}:
"Afghanistan"
"Akrotiri and Dhekelia"
"Albania"
"Algeria"
[...]
julia> subregions(GADM, "France")
13-element Array{String,1}:
"Auvergne-Rhône-Alpes"
"Bourgogne-Franche-Comté"
"Bretagne"
"Centre-Val de Loire"
[...]
julia> subregions(GADM, "France", "Bretagne")
4-element Array{String,1}:
"Côtes-d'Armor"
"Finistère"
"Ille-et-Vilaine"
"Morbihan"
julia> subregions(NUTS)
37-element Array{String,1}:
"AL"
"AT"
"BE"
"BG"
[...]
julia> subregions(NUTS, "UK")
179-element Array{String,1}:
"UKC11"
"UKC12"
"UKC13"
"UKC14"
[...]
julia> subregions(NUTS, "UKN")
11-element Array{String,1}:
"UKN06"
"UKN07"
"UKN08"
"UKN09"
[...]
Finally we have everything we need to actually calculate potential capacities and hourly capacity factors for solar-, wind- and hydropower. This is the basic syntax, which assumes default values for all unlisted GIS parameters.
julia> GISsolar(gisregion="Europe13")
julia> GISwind(gisregion="Europe13")
julia> GIShydro(gisregion="Europe13")
Here is a call that changes some parameters:
julia> GISwind(gisregion="Europe13", scenarioyear="ssp2_2020", era_year=2016, persons_per_km2=100,
max_depth=60, min_shore_distance=2, area_onshore=0.05, area_offshore=0.20)
The parameters that can be changed are listed in the section "GIS options" below. GISwind()
and
GISsolar()
also take an optional boolean parameter plotmasks=true
that will generate .png images
of the dataset masks resulting from the other parameters. This will increase run times by a minute
or two. Images will be placed in /GISdata_folder_path/output
.
julia> GISwind(gisregion="Europe13", ..., plotmasks=true)
julia> GISsolar(gisregion="Europe13", ..., plotmasks=true)
The synthetic demand module estimates the profile of hourly electricity demand in each model region given the total annual demand (determined by current national electricity demand per capita extrapolated using the SSP background scenario and target year). This is done using machine learning, specifically a method called gradient boosting tree regression. This is similar to ordinary regression, except that underlying mathematical relationships between variables are determined automatically using a black box approach.
We train the model based on real electricity demand in 44 countries for the year 2015. Regression variables include calendar effects (e.g. hour of day and weekday/weekend indicators), temperature variables (e.g. hourly temperature series in the most populated areas of each model region, or monthly averages as seasonality indicators) and economic indicators, e.g. local GDP per capita or electricity demand per capita (using the latter variable is not "cheating", since we are merely interested in predicting hourly profiles of normalized demand, not the demand level).
Assuming you have already downloaded the requisite temperature data for the year you want to study
(see era5download()
and maketempera5()
), and created population and GDP datasets for the SSP scenario
(see create_scenario_datasets()
):
julia> predictdemand(gisregion="Europe8", sspscenario="ssp2-26", sspyear=2050, era_year=2018)
This will create a matrix (size 8760xnumber_of_regions
) with the predicted electricity demand for each
model region and hour of the year. This data is saved in a new JLD file in /GISdata_folder_path/output
. Here the full SSP scenario variant must be specified including the 2-digit code representing radiative forcing target (e.g. 19, 26, 34, 45).
These are the default nine variables (so this will produce the exact same result as the previous example).
julia> selectedvars = [:hour, :weekend01, :temp_monthly, :ranked_month, :temp_top3,
:temp1_mean, :temp1_qlow, :temp1_qhigh, :demandpercapita]
julia> predictdemand(variables=selectedvars, gisregion="Europe8", sspscenario="ssp2-26", sspyear=2050, era_year=2018)
And here is a simpler example using seven variables:
julia> selectedvars = [:hour, :weekend01, :ranked_month, :temp_top3, :temp1_qlow, :temp1_qhigh, :gdppercapita]
julia> predictdemand(variables=selectedvars, gisregion="Europe8", sspscenario="ssp2-26", sspyear=2050, era_year=2018)
Currently we calculate data for 12 different variables. Any combination of these can be used to train on. The full list along with brief explanations appears below near the bottom of this README.
These are the default parameters:
julia> predictdemand(variables=defaultvariables, gisregion="Europe8", sspscenario="ssp2-34", sspyear=2050, era_year=2018,
nrounds=100, max_depth=7, eta=0.05, subsample=0.75, metrics=["mae"])
And here we modify them:
julia> predictdemand(variables=defaultvariables, gisregion="Europe8", sspscenario="ssp1-45", sspyear=2030, era_year=2018,
nrounds=40, max_depth=8, eta=0.30, subsample=0.85, metrics=["rmse"])
These parameters are explained briefly below. Additionally, any other XGBoost parameters can be specified.
Cross-validation of the training dataset can help determine which variables to train on and what values of learning parameters to use. This will predict the demand for all 44 countries in the demand dataset, but the model built for each country will only use data from the other 43 countries.
Iterations appear significantly slower than predictdemand()
because it trains 44 models in parallel.
julia> crossvalidate(variables=defaultvariables, nrounds=100, max_depth=7, eta=0.05, subsample=0.75, metrics=["mae"])
Note that there is no gisregion
argument since we are both training and predicting the same 44 country
dataset. The log will show two columns of training errors. The right column cv-train-mae
will have lower
errors, but this results from evaluating the model on the same data it trained on (i.e "cheating"). The
column cv-test-mae
on the left is the real (non-cheating) result. Resist the temptation of adapting
parameters to the right column.
solaroptions() = Dict(
:gisregion => "Europe8", # "Europe8", "Eurasia38", "Scand3"
:filenamesuffix => "", # e.g. "_landx2" to save high land availability data as "GISdata_solar2018_Europe8_landx2.mat"
:pv_density => 45, # Solar PV land use 45 Wp/m2 = 45 MWp/km2 (includes PV efficiency & module spacing, add latitude dependency later)
:csp_density => 35, # CSP land use 35 W/m2
:pvroof_area => .05, # area available for rooftop PV after the masks have been applied
:plant_area => .05, # area available for PV or CSP plants after the masks have been applied
:distance_elec_access => 300, # max distance to grid [km] (for solar classes of category B)
:plant_persons_per_km2 => 150, # not too crowded, max X persons/km2 (both PV and CSP plants)
:pvroof_persons_per_km2 => 200, # only in populated areas, so AT LEAST x persons/km2
# US census bureau requires 1000 ppl/mile^2 = 386 ppl/km2 for "urban" (half in Australia)
# roughly half the people of the world live at density > 300 ppl/km2
:exclude_landtypes => [0,1,2,3,4,5,8,12], # exclude water, forests and croplands. See codes in table below.
:protected_codes => [1,2,3,4,5,8], # IUCN codes to be excluded as protected areas. See codes in table below.
:scenarioyear => "ssp2_2050", # default scenario and year for population and grid access datasets
:era_year => 2018, # which year of the ERA5 time series to use
:res => 0.01, # resolution of auxiliary datasets [degrees per pixel]
:erares => 0.28125, # resolution of ERA5 datasets [degrees per pixel]
:pvclasses_min => [0.08,0.14,0.18,0.22,0.26], # lower bound on annual PV capacity factor for class X [0:0.01:0.49;]
:pvclasses_max => [0.14,0.18,0.22,0.26,1.00], # upper bound on annual PV capacity factor for class X [0.01:0.01:0.50;]
:cspclasses_min => [0.10,0.18,0.24,0.28,0.32], # lower bound on annual CSP capacity factor for class X
:cspclasses_max => [0.18,0.24,0.28,0.32,1.00] # upper bound on annual CSP capacity factor for class X
)
windoptions() = Dict(
:gisregion => "Europe8", # "Europe8", "Eurasia38", "Scand3"
:filenamesuffix => "", # e.g. "_landx2" to save high land availability data as "GISdata_solar2018_Europe8_landx2.mat"
:onshore_density => 5, # about 30% of existing farms have at least 5 W/m2, will become more common
:offshore_density => 8, # varies a lot in existing parks (4-18 W/m2)
# For reference: 10D x 5D spacing of 3 MW turbines (with 1D = 100m) is approximately 6 MW/km2 = 6 W/m2
:area_onshore => .08, # area available for onshore wind power after the masks have been applied
:area_offshore => .33, # area available for offshore wind power after the masks have been applied
:distance_elec_access => 300, # max distance to grid [km] (for wind classes of category B and offshore)
:persons_per_km2 => 150, # not too crowded, max X persons/km2
# US census bureau requires 1000 ppl/mile^2 = 386 ppl/km2 for "urban" (half in Australia)
# roughly half the people of the world live at density > 300 ppl/km2
:max_depth => 40, # max depth for offshore wind [m]
:min_shore_distance => 5, # minimum distance to shore for offshore wind [km]
:exclude_landtypes => [0,11,13], # exclude water, wetlands and urban areas. See codes in table below.
:protected_codes => [1,2,3,4,5,8], # IUCN codes to be excluded as protected areas. See codes in table below.
:scenarioyear => "ssp2_2050", # default scenario and year for population and grid access datasets
:era_year => 2018, # which year of the ERA5 time series to use
:rescale_to_wind_atlas => true, # rescale the ERA5 time series to fit annual wind speed averages from the Global Wind Atlas
:res => 0.01, # resolution of auxiliary datasets [degrees per pixel]
:erares => 0.28125, # resolution of ERA5 datasets [degrees per pixel]
:onshoreclasses_min => [2,5,6,7,8], # lower bound on annual onshore wind speeds for class X [0:0.25:12.25;]
:onshoreclasses_max => [5,6,7,8,99], # upper bound on annual onshore wind speeds for class X [0.25:0.25:12.5;]
:offshoreclasses_min => [3,6,7,8,9], # lower bound on annual offshore wind speeds for class X
:offshoreclasses_max => [6,7,8,9,99] # upper bound on annual offshore wind speeds for class X
)
hydrooptions() = Dict(
:gisregion => "Europe8", # "Europe8", "Eurasia38", "Scand3"
:costclasses_min => [ 0, 50, 100], # US $/MWh
:costclasses_max => [50, 100, 999],
:storageclasses_min => [ 0, 1e-6, 12], # weeks (discharge time)
:storageclasses_max => [1e-6, 12, 9e9]
)
0 'Water'
1 'Evergreen Needleleaf Forests'
2 'Evergreen Broadleaf Forests'
3 'Deciduous Needleleaf Forests'
4 'Deciduous Broadleaf Forests'
5 'Mixed Forests'
6 'Closed Shrublands'
7 'Open Shrublands'
8 'Woody Savannas'
9 'Savannas'
10 'Grasslands'
11 'Permanent Wetlands'
12 'Croplands'
13 'Urban'
14 'Cropland/Natural'
15 'Snow/Ice'
16 'Barren'
1 'Ia' 'Strict Nature Reserve'
2 'Ib' 'Wilderness Area'
3 'II' 'National Park'
4 'III' 'Natural Monument'
5 'IV' 'Habitat/Species Management'
6 'V' 'Protected Landscape/Seascape'
7 'VI' 'Managed Resource Protected Area'
8 'Not Reported' 'Not Reported'
9 'Not Applicable' 'Not Applicable'
10 'Not Assigned' 'Not Assigned'
List of variables in the training dataset that can be use for the regression:
Calendar variables
:hour hour of day
:month month of year
:weekend01 weekend indicator
Hourly temperatures:
:temp1 temperature in the largest population center of each region
:temp_top3 average temperature in the three largest population centers of each region
Monthly temperatures (season indicators):
:temp_monthly average monthly temperature in the largest population center of each region
:ranked_month rank of the average monthly temperature of each month (1-12)
Annual temperature levels and variability:
:temp1_mean average annual temperature in the largest population center of each region
:temp1_qlow low annual temperature - 5% quantile of hourly temperatures
:temp1_qhigh high annual temperature - 95% quantile of hourly temperatures
Economic indicators:
:demandpercapita level of annual average electricity demand [MWh/year/capita] in each region
:gdppercapita level of annual average GDP per capita [USD(2010)/capita] in each region
Our default values are adapted to our data and differ from the default XGBoost parameters. The parameters
also need to be adapted to each other. For example, a lower eta
value may require a higher nrounds
value
to reach full benefit.
nrounds=100 # number of rounds of learning improvements
max_depth=7 # tree depth, i.e. complexity of the underlying black box model. Increasing this may lead to overfitting.
eta=0.05 # learning rate. Higher values will improve faster, but may ultimately lead to a less efficient model.
subsample=0.75 # how much of the training data to use in each iteration. Same tradeoff as 'eta' parameter.
metrics=["mae"] # "mae" = mean absolute error, "rmse" = root mean square error, or both. Note the brackets (it's a vector).
A slightly better but much more computationally demanding set of parameters: nnrounds=1000
, max_depth=7
, eta=0.005
, and subsample=0.05
. These were the parameters used to produce figures 1 and 2 in the paper above.
For more information on these and other selectable parameters, see https://xgboost.readthedocs.io/en/latest/parameter.html.