jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

[ci.jenkins.io] collect datadog metrics for ephemeral VMs #2980

Closed dduportal closed 2 years ago

dduportal commented 2 years ago

Service(s)

ci.jenkins.io

Summary

Part of high level topic https://github.com/jenkins-infra/helpdesk/issues/2769.

This issue is scoped to only ephemeral VM agents.

It requires the following tasks to be accomplished:

dduportal commented 2 years ago

Installation of the datadog agent on templates: https://github.com/jenkins-infra/packer-images/pull/318

timja commented 2 years ago

Need a datadog API key, stored as a system-level credential (not available for pipelines), injected into the agent startup process (cloud-init)?

Might be easier to inject at build time with packer? (needs to be in something like opt as home directories get recreated with new users and tmp isn't safe from tmp sweepers)

dduportal commented 2 years ago

Need a datadog API key, stored as a system-level credential (not available for pipelines), injected into the agent startup process (cloud-init)?

Might be easier to inject at build time with packer? (needs to be in something like opt as home directories get recreated with new users and tmp isn't safe from tmp sweepers)

We thought about this, but with the upcoming "packerization" for the Docker images, that mean the packer templates would be available publicly with the API key stored within (today, the VM templates are private in both AWS or Azure so less an issue).

Hence the "add the key as late as posible" in the process.

timja commented 2 years ago

I don't think there's a way with the Azure vm agents plugin to add the key secretly currently.

dduportal commented 2 years ago

I don't think there's a way with the Azure vm agents plugin to add the key secretly currently.

I was thinking about using the "VM First Startup Configuration"'s Init script. But I assume the challenge is to allow passing the credential as a variable in this script?

timja commented 2 years ago

Yes, there’s an open issue for it on the plugin to allow that, apart from that you can put it in plain text in the script or somewhere the agent can retrieve it from

dduportal commented 2 years ago

Yes, there’s an open issue for it on the plugin to allow that, apart from that you can put it in plain text in the script or somewhere the agent can retrieve it from

For both ec2 and azure vm, it looks ok to put the api key in the script:

Wdyt?

timja commented 2 years ago

Should be fine for datadog, although not ideal

smerle33 commented 2 years ago

WIP on azure for now : https://github.com/jenkins-infra/jenkins-infra/pull/2347

smerle33 commented 2 years ago

ec2 VM fail for now with :

Cloud-init v. 22.2-0ubuntu1~20.04.3 running 'modules:config' at Mon, 05 Sep 2022 09:21:26 +0000. Up 40.47 seconds.
Job for datadog-agent.service failed because the control process exited with error code.
See "systemctl status datadog-agent.service" and "journalctl -xe" for details.
Cloud-init v. 22.2-0ubuntu1~20.04.3 running 'modules:final' at Mon, 05 Sep 2022 09:21:28 +0000. Up 42.66 seconds.
Cloud-init v. 22.2-0ubuntu1~20.04.3 finished at Mon, 05 Sep 2022 09:21:29 +0000. Datasource DataSourceEc2Local.  Up 43.16 seconds

investigating ...

smerle33 commented 2 years ago

while starting manually the datadog agent on an ephemeral ec2 VM, it triggered an alert on diskspace. We will have to specify new limit for those agent not to get spammed with false alert.

dduportal commented 2 years ago

No action required for the alert about the disk: it make sense to alert if the hard drive is full. The default agents have ~100 Gb free so all is good :)

dduportal commented 2 years ago

Status:

ToDo:

dduportal commented 2 years ago

Wip on https://github.com/jenkins-infra/helpdesk/issues/3149

smerle33 commented 2 years ago

The datadog dashboard may not be the solution, a public dashboard cannot provide selectors on top (template variable) they are grayed in public view (thanks Michael Pailloncy) and the straight solution to "just" get the jenkins agent would be to tag them within the datadog.yaml file not really ideal ... I will get in touch with @jtnord to see if he could deal with just a call for info from us or if he can help by any ways to solve this ....

dduportal commented 2 years ago

Closing as we now have metrics from all ephemeral agents of ci.jenkins.io flowing in datadog.