thegreenwebfoundation / grid-intensity-exporter

A grid intensity exporter for use with prometheus. Designed to be used for understanding carbon footprint of compute.
Apache License 2.0
6 stars 1 forks source link

Make an exporter #1

Closed mrchrisadams closed 2 years ago

mrchrisadams commented 3 years ago

What this issue says, basically.

https://github.com/thegreenwebfoundation/grid-intensity-go/issues/4

rossf7 commented 3 years ago

@mrchrisadams I've added a Helm chart and integration test for Kubernetes support in https://github.com/thegreenwebfoundation/grid-intensity-exporter/pull/4

After this I'd like to add a Nomad task and integration test but I need to figure out how to test that so I did K8s first.

@ofpiyush I haven't added you as a reviewer as there isn't actually any Go code. But if you would like to review it just let me know. 🙏

mrchrisadams commented 3 years ago

@rossf7 I'm thinking through how you might replicate the approach you took with k8s, and apply it to nomad.

As I understand it with the k8s case the CI workflow is something along the lines of:

right?

I think the equivalent, minimal setup for nomad would be

Is that what you had in mind?

Based on the nomad docs here, I think it would be a case of:

# run nomad in dev mode
nomad agent --dev 

# submit the job defined in the 'grid-intensity.nomad' file, where we can be specific about the port to expose
nomad job run grid-intensity.nomad # submits the job to the running cluster

# hit the endpoint and port exposed, as defined in the  'grid-intensity.nomad'
rossf7 commented 3 years ago

@mrchrisadams Thanks, yes that's what I was thinking and running the nomad agent in dev mode sounds ideal.

For the job the metrics need to be available on port 8000. As the integration test connects to http://localhost:8000/metrics. But that looks doable so I think this will work great.

mrchrisadams commented 3 years ago

Sup @rossf7 - there's some more notes here that might help:

https://discuss.hashicorp.com/t/local-development-workflow-with-nomad-consul-and-docker/3641/5

Ah… that points to a vagrantfile in the nomand repo too, showing how they set it up. That nomad set up is pretty extensive, and we might not need it all. In our case, if we follow the example of the integration tests you put together for kubernetes, this would, I suspect the key thing we'd need would be somewhere to fetch the generated grid-intensity image from to run

https://github.com/hashicorp/nomad/blob/master/Vagrantfile

mrchrisadams commented 3 years ago

hey @rossf7 I think with #5 and #4 merged in, I think the main outstanding bit before we can merge in might be some docs for this, and maybe some sketch of how it works.

For a sketch, I could put together something with Plant UML, to demonstrate the configuration for it for the three configurations running on some cluster (i.e. docker, nomad, k8s).

Anything else?

rossf7 commented 3 years ago

Hi @mrchrisadams, thanks, a sketch would help a lot. 👍 For the docs I'm happy to help as a reviewer or author.

Then I agree I think we can close this.

mrchrisadams commented 2 years ago

Hi @rossf7 !

(this is a bit of a brain dump, and probably ought to be a separate issue, or even a separate project. apologies in advance for it going all over the place)

As I mentioned in thegreenwebfoundation/grid-intensity-go/issues/4, I think there's a way to consume these metrics so that the Nomad scheduler can take them into account when making scheduling decisions.

As you mentioned before the Nomad autoscaler can consume this exporter data. You can see it referred to in the check stanza of a policy like this - by the APM. In the example below, this check is run regularly to decide whether to change the size of the pool of nodes in available to allocate jobs to:

# check for this dynamic value and use it as a criterion when making an auto scaling decision
check {
  source = "prometheus"
  query  = "avg((haproxy_server_current_sessions{backend=\"http_back\"}))"
}

However, I'm not sure if the APM plugin would be the ideal place for us experimenting, as that would be used for continuously updating jobs to auto scale to a set target.

You might use this at a nomad server level, to run a query every N minutes to see if a value is in a threshold, and then decide to trigger an auto scaling event.

So a policy applied to a job might look like this. :

job "important-but-not-urgent-job" {
  # we want to run it to completion then stop
  # we might be okay with it being prempted and delayed
  type        = "batch"

  # we need a full list of datacentres as candidates for placement, and these
  # could be in different regions with different grid intensities
  datacenters = ["dc1", "dc2", "dc3", "dc4"]

  group "machine_learning" {

    task {
      driver = "docker"

      config {
        image = "greenweb/computationally-expensive"
      }
    }

    scaling {
      min     = 0
      max     = 10
      enabled = true

      # low carbon compute policy - actively look for client nodes with a carbon intensity close
      # to this level
      policy {
        # check every 30 mins 
        evaluation_interval = "30m"

        # after a reshuffle, don't reschuffle again for at least an hour
        cooldown            = "1h"

        check "target_carbon_intensity" {
          source = "prometheus"
          query  = "scalar(local_carbon_intensity)"

          # when carbon intensity of compute goes above 85 on the index (I think)
          # trigger an autoscale event and reschedule.
          strategy "threshold" {
            upper_bound = 90
            lower_bound = 00
            delta = 5
          }
        }
      }
    }
  }
}

However, I think the APM stuff is designed to see when to trigger a re-evaluation, but it still wouldn't know about how to choose the right nodes to bin pack onto, because any carbon intensity metrics would need to be visible to the scheduler, and I don't think this would result in switching nodes off.

For that, I think we'd need a way to influence the ranking phase of the scheduling, and be able to actively filter them out o the ranked list. This monster function seems to be the part that that ranks nodes, when choosing which nodes to run jobs on:

https://github.com/hashicorp/nomad/blob/main/scheduler/rank.go#L193-L527

You'd probably need a way for that function to query a node's stats during the ranking phase, to query for local carbon intensity for the node, and use that as a criterion.

This here looks like a sample test you might use to see if ranking candidate nodes returns them in the order you'd expect.

I think we might be able to make a test demonstrating querying for a node property there, and ask the folks in the nomad community form to see how you might satisfy that test for carbon awareness. https://github.com/hashicorp/nomad/blob/main/scheduler/rank_test.go#L136-L251

Other related links

This post by Bill Johnson largely explains how they tried something related with k8s, to run jobs in geographically distinct places. I hadn't realised before that they use the same Watttime API that I had been checking out this week.

https://devblogs.microsoft.com/sustainable-software/carbon-aware-kubernetes/

That uses the paid API. If you are only thinking about moving jobs through time, and not across geographic regions I think the watttime margin intensity index API would be sufficient, for building a prototype. See this note book for more. https://nextjournal.com/greenweb/experiments-with-the-free-marginal-carbon-intensity-from-wattime

See also this new paper - they reckon you can get 20% carbon savings through thoughtful scheduling of work that doesn't need to be done right away. https://arxiv.org/abs/2110.13234

rossf7 commented 2 years ago

Hey @mrchrisadams I think we would need to use both APM and Target plugins for this.

However, I think the APM stuff is designed to see when to trigger a re-evaluation, but it still wouldn't know about how to choose the right nodes to bin pack onto,

The APM plugin is used to store the metrics. Using Prometheus for that makes sense to me. We already have the carbon intensity exporter and adding new metrics or running more exporters is easy to do.

When carbon intensity is high we will want to scale down or to zero. The target plugin can be used for horizontal cluster autoscaling which is the term used for nomad adding or removing nodes. For cloud this is straightforward e.g. on AWS you can use the auto-scaling-group target plugin.

For onprem the nomad target plugin might be useful. As it would allow scaling the number of containers. If there is an API for the physical nodes then a target plugin could be written and used to decide which nodes to shutdown.

Its also cloud but for example this is a target plugin for Digital Ocean. The downside is a target plugin is needed per infra provider but that seems to be the architecture they are going for.

This is pretty much identical in kubernetes. The cluster-autoscaler is used for scaling nodes and there are plugins for multiple providers. I just saw that this includes Hetzner https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider

We can see but I don't think they would want this logic in the nomad scheduler. It's up to the cluster operator to decide how many nodes the cluster should have. The scheduler then does the bin packing of containers to nodes.

mrchrisadams commented 2 years ago

Ah, so if I understand correctly: the key difference is that the scheduler would never make any decisions about how big, or small the pool of compute might be - it would just take care of distributing the load onto the available nodes with the lowest carbon intensity, right?

That would leave any scaling or scaling plugin to be responsible for changing the size of some combo of the pool of nodes (be these cloud VMs, physical machines) , the resources allocated to each job (i.e. the size of the tasks inside a job, as controlled by the task drive - be they containers, regular isolated for exec, firecracker microVMs and so on).

If that's the case, I can see how this might have a measurable impact even if you were just looking at the scheduler in isolation:

  1. I think dynamic of various processors (i.e. how much the power changes between states of low and high usage ) is wide enough for this to be measurable now. The way this is measured in cloud carbon footprint, you could even have numbers change fast for it to appear in a dashboard, as their carbon calculations (and I think project carbon figures forward).
  2. if the bin packing is smart enough to free up a node entirely, and it's consistently shown to be surplus to requirements, then it's easier to make the argument that it ought to be removed from the pool

I think this would also leave room for the operator to make a decisions or have strategies that either:

  1. let some work queue up in anticipation of a period of low carbon compute in the near future (see the paper above) or
  2. accept the higher emissions and increase the resources allocated to the pool, based on other criteria.

In both cases, I think this would support using something like a internal carbon price, or internal carbon budget over a set period - you'd track cumulative CO2 emissions for a given service against it, and you could have useful discussions about the strategies you might employ like the two above to get the work done, whilst staying inside the budget (cumulative CO2 as part of an SLO, for example).

Thinking through how you'd split this work.

Based on what you just shared, I now think there's two possible parts of this idea.

  1. having the scheduler able to use carbon intensity on the host/node as a criterion when deciding where or (when) to run jobs
  2. extending an autoscaler to have an awareness of the carbon intensity of different regions, so when a check is run, the pool of resources is expanded in the region with the lowest carbon intensity, assuming other criteria about latency, hardware requirements, cost, data protection are equal

My guess would be that of these two, the first is the more interesting one to do technologically, and would require less domain knowledge around carbon emissions - we figure out how to add another scalar value to be used in the scheduler, and then we rely on manual decisions to add or remove resources from a pool used for bin packing.

The second one is would involve adding or removing from a pool of resources, and if there is freedom to choose between different regions with a single provider doing so. I figure we'd do this looking at the carbon intensity of each region and picking the lowest one, assuming it still fits the other criteria for the job.

As a starting point, you could use data like this for google cloud project:

https://github.com/GoogleCloudPlatform/region-carbon-info/blob/main/data/yearly/2020.csv

Or this for Amazon: https://github.com/cloud-carbon-footprint/cloud-carbon-footprint/blob/trunk/packages/aws/src/domain/AwsFootprintEstimationConstants.ts#L144-L171

Which one would you be more interested in?

mrchrisadams commented 2 years ago

Hey @rossf7 I'm gonna close this as things have moved on a bit now :)

Let's look at the first thing:

having the scheduler able to use carbon intensity on the host/node as a criterion when deciding where or (when) to run jobs

Nomad now has a carbon aware scheduler in an experimental branch:

https://github.com/hashicorp/nomad/blob/h-carbon-meta/CARBON.md

It currently consumes data from a couple of providers, and but I can't remember if it's using any of the libraries we've worked on.

extending an autoscaler to have an awareness of the carbon intensity of different regions

I think the idea of extending an autoscaler to grow and shrink the pool largely relies on having access to data that changes frequently enough to make autoscaling decisions worthwhile.

While there is access on an individual basis to specific countries, I'm not aware of a handy feed where you would use this, so to make these calls, you'd likely need to use the electricity map provider (I think we have API keys for experimenting here) or us to implement a Wattime provider (we have keys for this too, for experimenting)

mrchrisadams commented 2 years ago

Actually @rossf7 - would you mind closing this issue once you've had a chance to re-read this thread, and if you see the need to create any new issues in the respective libraries, you have created them?

We covered quite a lot in this issue, and I didn't want to close it until we've both had a chance to revisit some of the ideas here in 2022...

rossf7 commented 2 years ago

Hi @mrchrisadams, back to this. I've done another pass through the issue.

I think the idea of extending an autoscaler to grow and shrink the pool largely relies on having access to data that changes frequently enough to make autoscaling decisions worthwhile.

Yes, I agree without frequent fresh data an autoscaler can't make effective decisions. https://github.com/thegreenwebfoundation/grid-intensity-go/issues/25 you created for the Watttime marginal intensity API looks the best option for this right now.

For autoscaling a first step would be having Prometheus metrics including the more frequent data. We could then look at things like KEDA scalers or cluster-autoscaler support if needed but without frequent data they are not that useful.

For the carbon branch for Nomad it isn't using grid-intensity-go directly but the data sources are the same. It's using ElectricityMap, gridintensity.org.uk plus climateiq.io

https://github.com/hashicorp/nomad/blob/h-carbon-meta/CARBON.md

Once the Watttime API is integrated we could add it there too. Or even create a fork of that branch to experiment with. But again https://github.com/thegreenwebfoundation/grid-intensity-go/issues/25 is needed for that.

So I'm going to close this but I'll post in the Watttime issue about possible next steps so we don't lose this.