Global Energy Data Commons - Data science and programming challenges for the Global Power Plant Database

World Resources Institute

Our vision for the future is to create the “Global Energy Data Commons”, that will inform and shape a global research agenda. We will create actionable, open, and accessible information on the cost, benefits of potential solutions for an affordable, reliable & sustainable energy system. Our current scope is focused on a open database of power plants and the GitHub repository is available here: https://github.com/wri/global-power-plant-database

Core Challenge

We see the following categories of challenges that the open sources community could help address:

Smaller programming and data collection challenges that require Python skills (approx. 1-4 weeks each)

Integrate additional data sources (Egypt, Chile, ..) Writing data extractors in Python and matching power plants with existing data in the database with elastic search or other methods
Update existing data sources
Examine and update data sources that have not been updated for an extended period and adjust existing Python code to update data
Validate data with satellite images or across data sources
Write Python code that is extracting satellite tiles for geolocations where power plants are in the database to manually confirm if geolocations are accurate.
Identify and collect new data manually Where automated extraction is not possible, collect, standardize and match power plant data manually from publicly available sources

Technical infrastructure improvements that require Python, data base, data processing or data science skills (approx. 1 - 3 months each)

Create a flexible relational data model
The current data model cannot represent one-to-many relationships, which are necessary to store more granular data. A new data model needs to be planned an implemented
Split up the data processing chain into steps
The current code is running in one combined processing step. An improved system would separate (modularize) the steps by specific sources, data extraction, intermediate data storing, data matching and final data base creation.

Research projects that require 1-3 people with modeling skills, GIS, and some energy expertise (approx. 3-9 months or longer each)

Predict other emissions like NO2 from satellite measurements
TROPOMI data, released to the public in July 2018, allows for precise measurement of air pollution and determination of its chemical components (including NOx emissions). Combining this data with the power plant database can enable the creation of a global dataset of daily NOx emissions for large power plants. A reliable dataset of plant location, overlaid with granular information on NOx emissions and other key data layers (topography, wind speed), will allow for the first NOx emissions attribution to individual power plants at the global level. In countries where NOx emission data is not reported, this will provide particularly useful information. Even where NOx emissions are reported, they are rarely as disaggregated as daily.
Using Natural Language Processing to collect public data
We propose building an NLP tool that will ingest text-based documents and perform entity and action resolution to pair power plants with indicators that we are currently missing (priority on the companies that own them). There is a large amount of power plant ownership information available online in the form of public financial filings, press releases, news articles, and other text-based documents. The NLP tool will ingest documents and determine (a) which power plant is being described (entity resolution) and (b) the power plant ownership and possibly other characteristics (action resolution).
Remote sensing to detect cooling type Water-use of power plants is a key impact and vulnerability. We have a model to estimate the water use based on cooling type of plants. We propose to build an machine vision algorithm that uses location of plants and satellite imagery to automatically detect cooling type by plants. Training data is available and could be expanded upon.
Remote sensing to detect wind farms or solar plants There has been some research piloted by universities to use machine vision to automatically detect wind farms or solar plants. These models will need to be expanded upon to achieve higher accuracy, increase geographic coverage, extract technical characteristics, integrating them into the power plant database and estimates additional indicators.
Improve models to predict generation or CO2 emissions We have built models to estimate annual generation and emissions by plant. These models represent a first version and can likely be significantly improved.

oviohub / opportunities