thegreenwebfoundation / grid-intensity-exporter

A grid intensity exporter for use with prometheus. Designed to be used for understanding carbon footprint of compute.
Apache License 2.0
6 stars 1 forks source link

Add integration test for running this in a nomad cluster, for #1 #5

Closed mrchrisadams closed 3 years ago

mrchrisadams commented 3 years ago

This PR should demonstrate the use of the grid-intensity-exporter for prometheus, as running in development nomad cluster.

Well… when it works that is.

I'm sharing this PR mainly to get some input as I learn how to use nomad for this.

Once you have a nomad agent running in dev mode with the following command...

nomad agent

You should be able to run this job with the following invocation:

nomad run grid-intensity-exporter.nomad

I'm running this on a macbook pro 2015, running OS 10.15.x

What I need help with

However, I say should, because my docker-fu is terrible, and I'm still learning nomad.

What I think is going wrong is that the jobfile I isn't specifying where to look for the docker image to run, and is falling back to the public docker hub - hence the denied bit in the logs:

2020-12-14T11:29:10.311+0100 [ERROR] client.driver_mgr.docker: failed pulling container: driver=docker image_ref=grid-intensity-exporter:latest error="API error (404): pull access denied for grid-intensity-exporter, repository does not exist or may require 'docker login': denied: requested access to the resource is denied"
    2020-12-14T11:29:10.311+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=69551864-a7bd-aa64-7384-6dad0a34edf7 task=grid-intensity-exporter error="Failed to pull `grid-intensity-exporter`: API error (404): pull access denied for grid-intensity-exporter, repository does not exist or may require 'docker login': denied: requested access to the resource is denied"
    2020-12-14T11:29:10.311+0100 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=69551864-a7bd-aa64-7384-6dad0a34edf7 task=grid-intensity-exporter reason="Exceeded allowed attempts 2 in interval 30m0s and mode is "fail""
    2020-12-14T11:29:10.312+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=69551864-a7bd-aa64-7384-6dad0a34edf7

I think we need to either specify in the jobfile which registry to look in for the docker image, or maybe map the entire directory into the job, so the dockerfiles and stuff are visible.

I'm not sure though, and won't be able to check for a few days. Hopefully posting this here will resolve this, or at least shed light until I can revisit in a few days

benmarsden commented 3 years ago

Hi Chris,

What I think is going wrong is that the jobfile I isn't specifying where to look for the docker image to run, and is falling back to the public docker hub

I believe this is spot on. Where are you pushing the grid-intensity-exporter:integration-test image? If it's a public dockerhub image the easy solution is to update the image to (for example) <org>/<image-name>:<tag>.

However, it gets interesting if it's a private registry and credentials get involved. The easy answer if this is the case is to add the auth stanza to your job file like so:


    task "grid-intensity-exporter" {
      driver = "docker"

      config {
        image = "grid-intensity-exporter:integration-test"

        auth {
          username = "<username>"
          password = "<password>"
        }

        #...
      }

However, this wonderful declarative and GitOps-oriented world we live in means checking in your registry password into Git if you do this, which is obviously no bueno! There are approaches to solving this, though in my opinion none quite as elegant as that offered in Kubernetes, and vary depending on your config management approach. Just some examples that stand out in my mind:

I cannot think of other solutions at the moment, but let me know if we're going to need to tackle this and I can try to think about it further 👍

mrchrisadams commented 3 years ago

Thanks @benmarsden , that really helps.

In this case, I was pulling it from what i think would be the 'local' registry on my machine, but my goal would be to have it on a public repository, so that anyone can contribute by pulling it down and starting to work with it. I think simply pushing the

I figure there would be a simple enough way for a local nomad agent running on my laptop (or on a VM on my laptop) to know about the local docker registry on my laptop, as pushing an image all the way to an internet accessible repo, only to pull it back down to run locally sounds kray.

I think there's enough in this issue now for me to ask a question in the hashicorp forums, and point back to this.

Thanks again for the pointers 👍

mrchrisadams commented 3 years ago

OK, I think I have this working, and in manual testing, see the output I expected to see.

I've cleaned up the sample nomad job file, and added a skeleton github action, which with some luck, we can either find the corresponding actions, or get some guidance on how to carry out the steps I've written.

I've used some obviously-not-real github actions, in the workflow file, which we'd replace with actual code, but once we have this, we ought to have at least some minimal examples of this to demonstrate its intended use.

mrchrisadams commented 3 years ago

As an aside the thing that tripped me up turned out to be a bug in nomad itself:

https://github.com/hashicorp/nomad/issues/8934

Then this is resolved, I think we could update the example job too.

@benmarsden - do you know of any useful github actions for nomad, or API events to listen to to check that a submitted job has started running?

This stuff is still new to me, but I figure it's got to exist…

mrchrisadams commented 3 years ago

OK, I think I've figured a way to wait until a job is reporting as running successfully.

The bash script I've added shows how you can call nomad, then run a loop, polling the API until we get the result we expect before proceeding with the integration test. It's not pretty, but it's working locally, and is just relying on bash.

I think the we now need a reliable way to download the nomad binary, chmod it so it's executable, then run.

We could fetch this each time, but this feels like a thing we'd want to be able to cache, as we only want to change it when the latest stable release of nomad comes out.

This where my knowledge of github actions runs out, so I'll need some help from a kind soul, to finally get this last bit passing…

benmarsden commented 3 years ago

@mrchrisadams I have used https://github.com/actions/cache before as an action for caching go modules. I wonder if that would be appropriate for binaries too?

rossf7 commented 3 years ago

@mrchrisadams @benmarsden I had a look at this last night and created https://github.com/thegreenwebfoundation/grid-intensity-exporter/pull/6 based on this. I have a couple of ideas but I'm having a weird problem with the wait script.

For installing nomad since the action is running on an ubuntu box I think we could use apt-get. Hashicorp have their own apt repository that they push new releases too. https://www.nomadproject.io/docs/install WDYT?

For the image it's being built in the ubuntu box. So I think the nomad agent can access it but I couldn't confirm that. We could also try using a local registry like in the docker test?

But I think a public docker image would be useful. To make the exporter easier to use. I think we could use the docker github action to push a new tag whenever we create a release and update the latest tag.

For the wait script it works great for me locally but in the action I added nomad job status grid-intensity-exporter and the job is still pending so the integration test fails.

rossf7 commented 3 years ago

Hey @mrchrisadams I had another go at #6 and got the test working. It needed 2 changes.

There is a typo in the docker image in the nomad task. It misses the prefix for thegreenwebfoundation. This is why the nomad agent couldn't start the job.

Rather than using the wait script I added an exponential backoff to the go integration test. This makes the test a bit more complicated but I think it also makes it more reliable.

Hope that helps. Let me know if you want to apply the changes here or go with #6.

mrchrisadams commented 3 years ago

Hi @rossf7 would you mind adding the changes into this issue and squashing it in like before?

My reasoning is that I think being able to refer to the commentary in in #5 would be helpful for our future selves when understanding why we took the steps we did. There's not so much context in #6 for us to look up.

Also - I like your exponential backup approach for the go test, and I'm happy to remove the wait-for bash script from this repo.

For what it's worth, consider this commentary an approval at my end of the approach in #6 :)