Explore if 'on-device' ML model inference might be a valid alternative to relying on external APIs

mrchrisadams commented 1 year ago

We have a number of providers in the grid-intensity-go project, and I recently learned that open data exists for hourly carbon intensity figures for electricity consumption, for every balancing authority in the USA, that is also quite recent - the latest readings are only ever a few days old.

With access to this, I wonder if it might be possible to make a very simple ML-model to give an idea of what the likely carbon intensity figures for a given hour for a given year, using something like tensorflow or tensorflow-lite.

See below for more about tensorflow: https://www.tensorflow.org/lite/guide

As a currently understand it, when you train a model with data (like carbon intensity data in our case), the output artefact is a .tflite tensorflow file, that can be consumed by a number of libraries, written in javascript, python, golang and so on.

This would make it possible to fo on-device, private inferencing to get an idea of the likely average carbon intensity for a given hour of a given year, without needing to hit an external network service. We could use this .tflite model in this project, but it could just as easily be consumed in CO2.js or any other project where this would be useful.

With this approach, you'd ideally be able to see different figures for the likely carbon intensity of electricity for say… 4pm in middle of November vs 4pm in June - something I don't know how to do with without making external API requests, or running a whole full blown programme locally.

This would help with some basic forecasting, or estimating to inform scheduling or reporting in future, and still leave room for fetching more accurate and precise numbers if you need them.

Anyway, here's the golang library we might mess around to see if this possbile / sensible: https://github.com/mattn/go-tflite

This issues exists to add notes, and links to see if there is anything to this idea - my current thinking is that it might be cool, but I'm not planning to invest more time into exploring this until I've heard from people who know more than me about it, as I'm at the limits of my knowledge.

mrchrisadams commented 1 year ago

Sidenote I think this would also make our own IP-to-CO2 API (in fact it would be probably make the IP to CO2 API much more useful.

If we could check a given IP coordinates against balancing authority shapefiles, we'd have the info to make the wacky carbon aware networking stuff we looked into before quite a bit more plausible.

The usual caveats apply for GeoIP stuff about accuracy but still, it could be interesting.

JackKelly commented 1 year ago

Hi Chris! Sounds interesting!

Please may I ask a few quick questions:

How much history is available for US carbon intensity?
How resource-constrained is your embedded device? I assume that if you're thinking about running TFLite on it, then it must have a fair amount of performance!

On the topic of not making any API calls: As @peterdudfield mentioned in Slack, you'll get much better forecasts for carbon intensity if you have access to valid weather forecasts (especially for wind speed, irradiance, and temperature). Which would require API calls.

I'd also echo @peterdudfield's suggestion of keeping things as simple as possible. Instead of using TFLite, I'd echo Peter's suggestion of using something super-simple like linear regression. Or a boosted regression tree (available in sklearn; or use XGBoost).

If you really don't want to call an API then you definitely can learn a function which maps from just time-of-day and day-of-year and type-of-day (weekend / holiday / weekday) to carbon intensity. (And maybe also include the most recent actual carbon intensity numbers?). e.g. just feed the hour-of-day (encoded as an int in the range [0, 24)), day-of-week [0, 7), and week-of-year [0, 52) into a small boosted regression tree, and have it predict the carbon intensity.

mrchrisadams commented 1 year ago

Hi @JackKelly, thank for chiming in :)

How much history is available for US carbon intensity?

We have access to hourly carbon intensity of consumption information for every balancing authority in the USA right now, with a delay of a few days from realtime.

I'll be tidying up this repo in the coming weeks, which explains it in more detail:

https://github.com/mrchrisadams/hourly-carbon-intensity-usa

How resource-constrained is your embedded device? I assume that if you're thinking about running TFLite on it, then it must have a fair amount of performance!

Honestly, I don't have a really good idea.

tflite was the first thing I found that seemed to be smallish and easy to run on-device - the intention here was something from raspberry pi zero size upwards I guess, a bit like the size of server used by Scott Web in this link below:

https://scott.ee/project/solar-hosting-raspberry-pi/

If it works on something small like that, I figure it would also work in bigger set ups too, and be easy to deploy on a wide range of VMs.

About connectivity - the lack of network requests was largely based around the idea of removing barrier to participation. If there's a way to make network requests that would help, it might make sense, but to begin with, my thinking was just pulling down a small file at install time, and then doing on-device inference would be neat.

I hadn't really considered how you'd handle updates yet, as I wasn't sure what the artefacts you have available, and download when working with models like the ones suggested.

JackKelly commented 1 year ago

Sounds good!

On the question of how much historical data is available: ideally, you'd want at least one year of training data. Ideally multiple years. Do you know how many years of training data you can get your hands on?

mrchrisadams commented 1 year ago

Yeah, there's hourly data for every balancing authority in the US at the link below with :

https://www.eia.gov/electricity/gridmonitor/dashboard/electric_overview/US48/US48

Screenshot 2023-06-09 at 12 20 56

There are hourly readings of generation by type of gen (gas, solar, etc) on the balancing authority, as well as imports / exports going back to 2015 - all open data. From 2018 or 2019 (it varies across the balancing authorities) there is a new column riiiiigh at the end of the spreadsheet, with a calculated figure for carbon intensity of consumed power.

https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx

To give an idea of how up to date this is, the latest hourly reading with CO2 figures for consumption is midnight June 8th.

My guess is that every BA from 2019 onwards might have hourly data, as well as broken down figures for generation and import / export.

I've uploaded a sample zipfile containing a parquet file of the last 4-5 years of hourly readings for carbon intensity for the Western Area Power Administration - Upper Great Plains West balancing authority, who serve Iowa, Minnesota, Montana, Nebraska, North Dakota, and South Dakota as an example.

These parquet files are typically around 500k in size versus around 40mb for the spreadsheet to download for each balancing authority.

I have no idea how big files are that contain whatever serialisation of data that ML programmes need to consume. Pointers or links read would be very welcome.

Anyway, hope that's interesting.

hourly-co2-usa.ztd.parquet.zip

rossf7 commented 1 year ago

Hi Chris & Jack, just to chime in here too. I like the idea a lot!

It lowers the barrier to entry and its similar to the Ember dataset we already embed in the binary but is a much more granular dataset. We also have some caching support we could use for this and is likely better if the files are large.

I would agree with @JackKelly and @peterdudfield that including weather forecasts would be beneficial and maybe there is some US National Weather Service or similar API we could use?

However an offline version would be useful for some devices and simpler to develop to further evaluate the idea.

ssuffian commented 1 year ago

Hi all! I just spoke to @mrchrisadams and learned about some of the great work that you are doing! Here is an open source library that we use for pulling weather data from the US (and soon to be international): eeweather. It's a wrapper on the FTP site that NOAA hosts that contains hourly temperature data for all weather stations in the US, as well as code to help match a lat/lng to a weather station.

Also, the eemeter library itself might be helpful. It's a time-of-week-and-temperature regression model typically used to forecast electricity consumption (for the purposes of calculating savings from energy-efficiency projects) but could be used for any sort of weather and time-based forecasting. This maybe won't exactly fit the bill for predicting carbon intensity, but might be worth a shot.

These libraries are part of the LF Energy's OpenEEMeter project.

peterdudfield commented 1 year ago

Hi all! I just spoke to @mrchrisadams and learned about some of the great work that you are doing! Here is an open source library that we use for pulling weather data from the US (and soon to be international): eeweather. It's a wrapper on the FTP site that NOAA hosts that contains hourly temperature data for all weather stations in the US, as well as code to help match a lat/lng to a weather station.

Also, the eemeter library itself might be helpful. It's a time-of-week-and-temperature regression model typically used to forecast electricity consumption (for the purposes of calculating savings from energy-efficiency projects) but could be used for any sort of weather and time-based forecasting. This maybe won't exactly fit the bill for predicting carbon intensity, but might be worth a shot.

These libraries are part of the LF Energy's OpenEEMeter project.

https://herbie.readthedocs.io/en/stable/ - is another good open source project for pulling various weather data

thegreenwebfoundation / grid-intensity-go

Explore if 'on-device' ML model inference might be a valid alternative to relying on external APIs #70