Bridging our own ESG Data Gaps

MichaelTiemannOSC commented 3 years ago

Global Warming Potential (GWP) is an essential metric for any portfolio decarbonization decisions. There are many bottoms-up datasets that contain information about CO2e emissions, but it is difficult to judge their completeness. The ESSD dataset (https://essd.copernicus.org/articles/13/5213/2021/essd-13-5213-2021-assets.html) provides a peer-reviewed, global estimate of all GWP emissions from 1970 to the present day, with attribution to sub-region and sub-sector levels, according to IPCC taxonomies. 3 excel spreadsheets comprising 37MB of raw data (with about 800K rows of raw data) provides a good sense of the total scale of the GWP problem.

With that tops-down starting point, we can onboard bottoms-up data that fills in details as to specific sources on a country or regional basis, and also on a company, industry, sector, or super-sector basis. For example, the PUDL data provides detailed emissions data for US Utilities (including US territories and protectorates). Other datasets can be identified to fill gaps in other locations, across other industries, sectors, etc. But most importantly, with a tops-down view, we will have some idea how large of a gap a given dataset can fill in terms of GWP equations and analyses.

And of course, we want to tie any corporate ownership information to Legal Entity Identifiers, so that corporate hierarchies and connects with financial instruments can be understood.

A minimal satisfaction of this issue will therefore be:

[x] onboarding of ESSD data (s3://redhat-osc-physical-landing-647521352890/ESSD/).
[x] onboarding of PUDL data (relatively simple if we "federate at the Dataframe layer") (s3://redhat-osc-physical-landing-647521352890/PUDL/pudl-v0.5.0/).
[ ] enrichment of PUDL data with GLEIF data (currently matching 1215 of 3334 entities in their database).
[ ] connection between PUDL data and ESSD data (consistent country/region information, gas species information, etc).
[x] go beyond powerplant data (esp. for Physical Risk) by adding steel and cement plant data: s3://redhat-osc-physical-landing-647521352890/SFI_GeoAsset/
[x] onboarding of SEC DERA data (public company filings that include stock ticker and SIC code info; s3://redhat-osc-physical-landing-647521352890/SEC-DERA/).
[x] connection between DERA data and GLEIF.
[x] use of EPA National Combined data to link Entity Names with NAICS and SIC codes (trivially captures industry classification of major emitters like Utilities, Steel and Cement, Transportation, etc.). DERA data only covers publicly traded companies; EPA data captures and codes many private companies (who affect the environment). See s3://redhat-osc-physical-landing-647521352890/EPA/national_combined-20211104/.
[x] Onboard GHGRP data (s3://redhat-osc-physical-landing-647521352890/EPA/ghgp_data_parent_company_10_2021.xlsx and others).
[x] Connection of EPA GHGRP and SEC DERA data to each other and with GLEIF to map LEIs to industry/sector data.
[x] Connection of EPA FRS and SEC DERA data to each other via EIN (!)
[x] Upload WorldClim 10m dataset s3://redhat-osc-physical-landing-647521352890/WorldClim/wc2.1_10m/
[x] Onboard 2017 US Census all-sector data (s3://redhat-osc-physical-landing-647521352890/US_CENSUS/US_Census-All-Sector-Report-2017/)
[x] Align with ISO 3166-2 using https://github.com/esosedi/3166

It would be wonderful to ultimately be able to produce maps like this from our data: https://openghgmap.net/ (see also here: https://openghgmap.net/data/)

Should we also add climate data to the grid so that we can correlate temperature/percipitation/wind speed, etc. to macro-level effects like population change, energy use, etc? Sources referenced from this survey paper (https://www.nature.com/articles/s41597-020-00726-5) include WorldClim 2.1 (https://www.worldclim.org/data/worldclim21.html), which provides data from 1970-2000 and is the most cited source in the literature. WorldClim also offer monthly data that is more current (up through 2018), but only for a few variables (temp_min, temp_max, precipitation).

MichaelTiemannOSC commented 3 years ago

Created repository and committed changes for https://github.com/os-climate/essd-ingest-pipeline

MichaelTiemannOSC commented 3 years ago

@caldeirav @erikerlandson I still feel very unconfident about the metadata I'm collecting and storing. Would love to have your eyes on it and comments. Better documentation would lead to greater confidence, probably for many.

erikerlandson commented 3 years ago

Regarding UUID as a column: 1) if we set up uuid as one of the partition columns, it will not be stored as an actual column in the underlying parquet or orc files. 2) even if we do not use it as a partition column, the columnar-data compression formats should generate very high compression rates, since it is constant across many rows. This will particularly be true if we make sure it can leverage delta compression https://github.com/apache/parquet-format/blob/master/Encodings.md

MichaelTiemannOSC commented 3 years ago

So it looks to me like I'm not storing UUID as a column, but rather as part of the parquet naming structure, hence it's a partition column. So I should probably change my comment from a question ("Do we really want to add UUID to every row, or better to put into table-level metadata?") to a statement ("Use UUID as one of our partition columns") and make it regular, non-bold text..

MichaelTiemannOSC commented 3 years ago

Regarding the onboarding of PUDL data, we have a bit of a strategic decision to make. What the PUDL environment provides is a convenient way for data scientists to easily access Utility info in a very Pandas-friendly way (using their own SQLalchemy connector to SQLite databases). It also creates an abstraction layer so users don't have to think about what data is in SQLite (all the plant/utility/generator/generation/fuel data) and parquet (all the emissions data). The pudl.sqlite database may be something we connect to Trino via our own SQLalchemy connector. We could of course explode the pudl.sqlite database into its constituent tables and create our own rendition of the data in Trino tables. The parquet files are of course very easy to integrate.

But...what sort of centralization and consistency do we want to impose on data sources in terms of running though Trino vs a concept of Federation that could be as wide-open as saying pip install pudl (they highly encourage using conda) and filling one's dataframes from a data source that draws from the Data Commons but not from Trino per se?

erikerlandson commented 3 years ago

connecting trino directly to sqlite would be nice. I think if they ever get JDBC connector working this will become easy: https://github.com/trinodb/trino/pull/3105

Failing that, my concern with just exposing a deployment of pip install pudl is that it is higher friction to federate with anything on trino, although one could do the federation in "pandas space"

When we meet with the PUDL people, I'd like to discuss supporting some other strategic DBs, and ideally a single unified SQL interface to their entire set (for example, MySQL, PostgreSQL, BigQuery, might be interesting, and are already supported by trino). In such a scenario, we could run PUDL as a microservice and configure a trino connector to it.

erikerlandson commented 3 years ago

We could certainly write an ingest pipeline that consumes both their parquet and SqLite and injects both into trino. This seems less elegant but it would be effective

MichaelTiemannOSC commented 3 years ago

usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.

https://golangrepo.com/repo/xo-usql-go-database-tools

caldeirav commented 3 years ago

On the usage of UUID: in my latest ingestion demo sample (leveraging Iceberg), UUID is used for partitioning data at the data set level. Design-wise I feel it makes sense because we can only do delete based on partition elements. Could have used timestamp as well but timestamp is already captured by Iceberg in the snapshot table and may not be consistent for all INSERT into the table.

MichaelTiemannOSC commented 3 years ago

To wit...

schema = create_table_schema_pairs(df)
tabledef = """
create table if not exists osc_datacommons_iceberg_dev.{sname}.{tname} (
{schema}
) with (
    format = 'parquet',
    partitioning = ARRAY['uuid']
)
""".format(schema=schema, sname=schemaname, tname=tablename)
print(tabledef)

Thanks!

erikerlandson commented 3 years ago

usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.

I do not understand why the trino dev community has not given a generalized JDBC connector higher priority, since it would allow trino to connect to practically any db that exposes an SQL interface, probably including other trino dbs.

MichaelTiemannOSC commented 3 years ago

What they explain about halfway through the issue you posted is that they disbelieve that general solutions can be adequately tested. Instead, they put their faith in rigorous testing of specific connectors. Now...why there is no support for (super-popular) sqlite is a question they don't answer.

MichaelTiemannOSC commented 3 years ago

@HeatherAck -- your encouraging response led me to flesh this out a bit more, and I think it's turning into a project, but a very doable one, since all the data has been uploaded. It thus may truly qualify as a worthy sprint. As elaborated above, these datasets connect both with GLEIF and with each other, and by making these connections, we are well on our way to having enriched data. For example, a lack in the current RMI data is "Enterprise Value + Cash" but these are numbers that can be pulled from the SEC DARA data for publicly traded companies. Indeed, we now have richly described public financial data to go along with granular environmental data within a context of global emissions (and potentially climate) data.

If we could get this in shape by Dec 1st, it will give the Data Commons gigabytes of heft.

caldeirav commented 3 years ago

Couple of notes here:

The ingestion model via Iceberg connector is in my view more scalable as does not require file / file structure handling which can lead to inconsistencies. Also, we could easily build our own libs to facilitate handling of ingestion via standardised SQL-based operations (but ideally using the dataframe.to_sql functions will be preferred).
However as we start to really industrialise data ingestion, we need to take a decision soon as likely data would have to be re=processed if we switch along the way (migration is cumbersome).

caldeirav commented 3 years ago

Also I am starting a small POC now for metadata handling on visualisation / client side. Need to complete this before finalising metadata handling and structure.

MichaelTiemannOSC commented 3 years ago

Here's an annotated picture showing both how things fit together and also a sense of how many records we are attempting to fit together. Note that the 3K owners referenced in PUDL does not scratch the surface of the millions of rows of emissions data also in the PUDL database, for example.

Data Commons November Sprint

@ludans @DavWinkel @ChristianMeyndt @ttschmucker @toki8

erikerlandson commented 3 years ago

xref https://github.com/trinodb/trino/issues/10060 for SQLite connector support

MichaelTiemannOSC commented 3 years ago

Just checking in with the fact that my LEI matcher matches 3966 of the 5866 DERA companies, a 68% match rate. There are about 1890 that don't yet match, but the ones I hand-checked have no matches on the GLEIF website, so I suspect we are suffering from a lack of LEI coverage.

(I previously reported much larger numbers in all categories, but this is because I forgot to run sort prior to uniq when generating my input to the matcher.)

MichaelTiemannOSC commented 3 years ago

Alas, PUDL requires SQLalchemy version 1.4 to do some of its things (it can do a few things fine without it). But our Trino environment requires SQLalchemy version 1.3 in order to function at all. So PUDL and Trino are not good bunkmates right now.

MichaelTiemannOSC commented 3 years ago

I've checked in a demo notebook that begins to use the connections possible in the above diagram. I've also left a comment in the code that there is a LONG way to go before queries that seem simple are simple.

https://github.com/os-climate/data-platform-demo/blob/master/notebooks/EPA_GHGRP-demos.ipynb

MichaelTiemannOSC commented 2 years ago

PUDL data integration may depend on this: https://github.com/catalyst-cooperative/pudl/pull/1361

erikerlandson commented 2 years ago

I just ran some operations using the latest sqlalchemy and sqlalchemy-trino (pip install --upgrade ...), and it worked. This includes sqlalchemy==1.4.27 and sqlalchemy-trino==0.4.1

MichaelTiemannOSC commented 2 years ago

That's cheating 8-)

We do still have the question about onboarding data more deeply into the Data Commons.

MichaelTiemannOSC commented 2 years ago

Added ISO3166 data by uploading PyCountry data to Trino in a new data pipeline: https://github.com/os-climate/iso3166-ingest

os-climate / data-platform-demo

Bridging our own ESG Data Gaps #30