Open MichaelTiemannOSC opened 3 years ago
Created repository and committed changes for https://github.com/os-climate/essd-ingest-pipeline
@caldeirav @erikerlandson I still feel very unconfident about the metadata I'm collecting and storing. Would love to have your eyes on it and comments. Better documentation would lead to greater confidence, probably for many.
Regarding UUID as a column:
1) if we set up uuid as one of the partition
columns, it will not be stored as an actual column in the underlying parquet or orc files.
2) even if we do not use it as a partition
column, the columnar-data compression formats should generate very high compression rates, since it is constant across many rows. This will particularly be true if we make sure it can leverage delta compression
https://github.com/apache/parquet-format/blob/master/Encodings.md
So it looks to me like I'm not storing UUID as a column, but rather as part of the parquet naming structure, hence it's a partition column. So I should probably change my comment from a question ("Do we really want to add UUID to every row, or better to put into table-level metadata?") to a statement ("Use UUID as one of our partition columns") and make it regular, non-bold text..
Regarding the onboarding of PUDL data, we have a bit of a strategic decision to make. What the PUDL environment provides is a convenient way for data scientists to easily access Utility info in a very Pandas-friendly way (using their own SQLalchemy connector to SQLite databases). It also creates an abstraction layer so users don't have to think about what data is in SQLite (all the plant/utility/generator/generation/fuel data) and parquet (all the emissions data). The pudl.sqlite database may be something we connect to Trino via our own SQLalchemy connector. We could of course explode the pudl.sqlite database into its constituent tables and create our own rendition of the data in Trino tables. The parquet files are of course very easy to integrate.
But...what sort of centralization and consistency do we want to impose on data sources in terms of running though Trino vs a concept of Federation that could be as wide-open as saying pip install pudl
(they highly encourage using conda) and filling one's dataframes from a data source that draws from the Data Commons but not from Trino per se?
connecting trino directly to sqlite would be nice. I think if they ever get JDBC connector working this will become easy: https://github.com/trinodb/trino/pull/3105
Failing that, my concern with just exposing a deployment of pip install pudl
is that it is higher friction to federate with anything on trino, although one could do the federation in "pandas space"
When we meet with the PUDL people, I'd like to discuss supporting some other strategic DBs, and ideally a single unified SQL interface to their entire set (for example, MySQL, PostgreSQL, BigQuery, might be interesting, and are already supported by trino). In such a scenario, we could run PUDL as a microservice and configure a trino connector to it.
We could certainly write an ingest pipeline that consumes both their parquet and SqLite and injects both into trino. This seems less elegant but it would be effective
usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.
On the usage of UUID: in my latest ingestion demo sample (leveraging Iceberg), UUID is used for partitioning data at the data set level. Design-wise I feel it makes sense because we can only do delete based on partition elements. Could have used timestamp as well but timestamp is already captured by Iceberg in the snapshot table and may not be consistent for all INSERT into the table.
To wit...
schema = create_table_schema_pairs(df)
tabledef = """
create table if not exists osc_datacommons_iceberg_dev.{sname}.{tname} (
{schema}
) with (
format = 'parquet',
partitioning = ARRAY['uuid']
)
""".format(schema=schema, sname=schemaname, tname=tablename)
print(tabledef)
Thanks!
usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.
I do not understand why the trino dev community has not given a generalized JDBC connector higher priority, since it would allow trino to connect to practically any db that exposes an SQL interface, probably including other trino dbs.
What they explain about halfway through the issue you posted is that they disbelieve that general solutions can be adequately tested. Instead, they put their faith in rigorous testing of specific connectors. Now...why there is no support for (super-popular) sqlite is a question they don't answer.
@HeatherAck -- your encouraging response led me to flesh this out a bit more, and I think it's turning into a project, but a very doable one, since all the data has been uploaded. It thus may truly qualify as a worthy sprint. As elaborated above, these datasets connect both with GLEIF and with each other, and by making these connections, we are well on our way to having enriched data. For example, a lack in the current RMI data is "Enterprise Value + Cash" but these are numbers that can be pulled from the SEC DARA data for publicly traded companies. Indeed, we now have richly described public financial data to go along with granular environmental data within a context of global emissions (and potentially climate) data.
If we could get this in shape by Dec 1st, it will give the Data Commons gigabytes of heft.
Couple of notes here:
Also I am starting a small POC now for metadata handling on visualisation / client side. Need to complete this before finalising metadata handling and structure.
Here's an annotated picture showing both how things fit together and also a sense of how many records we are attempting to fit together. Note that the 3K owners referenced in PUDL does not scratch the surface of the millions of rows of emissions data also in the PUDL database, for example.
@ludans @DavWinkel @ChristianMeyndt @ttschmucker @toki8
xref https://github.com/trinodb/trino/issues/10060 for SQLite connector support
Just checking in with the fact that my LEI matcher matches 3966 of the 5866 DERA companies, a 68% match rate. There are about 1890 that don't yet match, but the ones I hand-checked have no matches on the GLEIF website, so I suspect we are suffering from a lack of LEI coverage.
(I previously reported much larger numbers in all categories, but this is because I forgot to run sort prior to uniq when generating my input to the matcher.)
Alas, PUDL requires SQLalchemy version 1.4 to do some of its things (it can do a few things fine without it). But our Trino environment requires SQLalchemy version 1.3 in order to function at all. So PUDL and Trino are not good bunkmates right now.
I've checked in a demo notebook that begins to use the connections possible in the above diagram. I've also left a comment in the code that there is a LONG way to go before queries that seem simple are simple.
https://github.com/os-climate/data-platform-demo/blob/master/notebooks/EPA_GHGRP-demos.ipynb
PUDL data integration may depend on this: https://github.com/catalyst-cooperative/pudl/pull/1361
I just ran some operations using the latest sqlalchemy and sqlalchemy-trino (pip install --upgrade ...), and it worked. This includes sqlalchemy==1.4.27 and sqlalchemy-trino==0.4.1
That's cheating 8-)
We do still have the question about onboarding data more deeply into the Data Commons.
Added ISO3166 data by uploading PyCountry data to Trino in a new data pipeline: https://github.com/os-climate/iso3166-ingest
Global Warming Potential (GWP) is an essential metric for any portfolio decarbonization decisions. There are many bottoms-up datasets that contain information about CO2e emissions, but it is difficult to judge their completeness. The ESSD dataset (https://essd.copernicus.org/articles/13/5213/2021/essd-13-5213-2021-assets.html) provides a peer-reviewed, global estimate of all GWP emissions from 1970 to the present day, with attribution to sub-region and sub-sector levels, according to IPCC taxonomies. 3 excel spreadsheets comprising 37MB of raw data (with about 800K rows of raw data) provides a good sense of the total scale of the GWP problem.
With that tops-down starting point, we can onboard bottoms-up data that fills in details as to specific sources on a country or regional basis, and also on a company, industry, sector, or super-sector basis. For example, the PUDL data provides detailed emissions data for US Utilities (including US territories and protectorates). Other datasets can be identified to fill gaps in other locations, across other industries, sectors, etc. But most importantly, with a tops-down view, we will have some idea how large of a gap a given dataset can fill in terms of GWP equations and analyses.
And of course, we want to tie any corporate ownership information to Legal Entity Identifiers, so that corporate hierarchies and connects with financial instruments can be understood.
A minimal satisfaction of this issue will therefore be:
It would be wonderful to ultimately be able to produce maps like this from our data: https://openghgmap.net/ (see also here: https://openghgmap.net/data/)
Should we also add climate data to the grid so that we can correlate temperature/percipitation/wind speed, etc. to macro-level effects like population change, energy use, etc? Sources referenced from this survey paper (https://www.nature.com/articles/s41597-020-00726-5) include WorldClim 2.1 (https://www.worldclim.org/data/worldclim21.html), which provides data from 1970-2000 and is the most cited source in the literature. WorldClim also offer monthly data that is more current (up through 2018), but only for a few variables (temp_min, temp_max, precipitation).