prometheus node exporter failing

hemna commented 6 years ago

after running stage 0 and 1, I get failures with the prometheus node exporter.


admin:/srv/pillar/ceph/proposals # deepsea salt-run state.orch ceph.stage.2
Starting stage: ceph.stage.2
Parsing ceph.stage.2 steps... ✓

[init]    validate.discovery(cluster=ceph)........................... ✓ (2s)

Stage initialization output:
deepsea_minions          : valid
yaml_syntax              : valid

[1/14]    push.proposal.............................................. ✓ (0.0s)

[2/14]    ceph.refresh on
          data1.ceph................................................. ✓ (0.4s)
          mon1.ceph.................................................. ✓ (0.5s)
          admin.ceph................................................. ✓ (0.3s)

[3/14]    advise.networks............................................ ✓ (0.5s)

[4/14]    ceph.admin.key on
          admin.ceph................................................. ✓ (0.3s)
            |_ file.managed(/srv/salt/ceph/admin/cache/ceph.client.admin.keyring) ✓

[5/14]    ceph.mon.key on
          admin.ceph................................................. ✓ (0.4s)
            |_ file.managed(/srv/salt/ceph/mon/cache/mon.keyring).... ✓

[6/14]    ceph.mgr.key on
            |_ select.minions(cluster=ceph, host=True, roles=mgr).... ✓ (0.5s)
          admin.ceph................................................. ✓ (3s)

[7/14]    ceph.osd.key on
          admin.ceph................................................. ✓ (0.4s)
            |_ file.managed(/srv/salt/ceph/osd/cache/bootstrap.keyring) ✓
            |_ file.managed(/srv/salt/ceph/osd/cache/ceph.client.storage.keyring) ✓

[8/14]    ceph.igw.key on
            |_ select.minions(cluster=ceph, host=True, roles=igw).... ✓ (0.1s)
          admin.ceph................................................. ✓ (3s)

[9/14]    ceph.mds.key on
            |_ select.minions(cluster=ceph, host=True, roles=mds).... ✓ (0.1s)
          admin.ceph................................................. ✓ (3s)

[10/14]   ceph.rgw.key on
            |_ select.minions(cluster=ceph, host=True, roles=rgw).... ✓ (0.1s)
          admin.ceph................................................. ✓ (3s)

[11/14]   ceph.ganesha.key on
            |_ select.minions........................................ ✓ (0.1s)
               (cluster=ceph, host=True, roles=ganesha)
          admin.ceph................................................. ✓ (3s)

[12/14]   ceph.openattic.key on
          admin.ceph................................................. ✓ (0.3s)
            |_ file.managed(/srv/salt/ceph/openattic/cache/ceph.client.openattic.keyring) ✓

[13/14]   ceph.monitoring on
            |_ select.minions(cluster=ceph, host=False).............. ✓ (0.3s)
            |_ select.minions(cluster=ceph, host=False).............. ✓ (0.3s)
            |_ select.minions(cluster=ceph, roles=rgw)............... ✓ (0.1s)
          admin.ceph................................................. ❌ (134s)
            |_ pkg.installed(golang-github-prometheus-prometheus).... ✓
            |_ file.managed(/etc/prometheus/prometheus.yml).......... ✓
            |_ pkg.installed(golang-github-prometheus-alertmanager).. ✓
            |_ pkg.installed(grafana)................................ ✓

[14/14]   ceph.monitoring.prometheus.exporters.node_exporter on
          data1.ceph................................................. ❌ (30s)
            |_ pkg.installed(golang-github-prometheus-node_exporter). ✓
            |_ pkg.installed(cron, smartmontools).................... ✓
          mon1.ceph.................................................. ❌ (29s)
            |_ pkg.installed(golang-github-prometheus-node_exporter). ✓
            |_ pkg.installed(cron, smartmontools).................... ✓
          admin.ceph................................................. ❌ (30s)
            |_ pkg.installed(golang-github-prometheus-node_exporter). ✓
            |_ pkg.installed(cron, smartmontools).................... ✓

Ended stage: ceph.stage.2 succeeded=12/14 failed=2/14 time=190.4s

Failures summary:

ceph.monitoring (/srv/salt/ceph/monitoring):
  admin.ceph:
    start prometheus-alertmanager: Service prometheus-alertmanager has been enabled, and is dead
ceph.monitoring.prometheus.exporters.node_exporter:
  mon1.ceph:
    start node exporter: Service prometheus-node_exporter has been enabled, and is dead
  admin.ceph:
    start node exporter: Service prometheus-node_exporter has been enabled, and is dead
  data1.ceph:
    start node exporter: Service prometheus-node_exporter has been enabled, and is dead```

denisok commented 6 years ago

right, what config are you using?

BOX = 'opensuse/openSUSE-42.3-x86_64'

INSTALLATION = 'salt'

CONFIGURATION = 'tiny'

?

hemna commented 6 years ago

yep.

denisok commented 6 years ago

do you use a fresh master?

it was fixed with https://github.com/openSUSE/vagrant-ceph/commit/88adae6745be66cba4ebe9243d56420beb0fd2c9

hemna commented 6 years ago

#BOX = 'opensuse/openSUSE-42.2-x86_64'
#BOX = 'SLE12-SP2-migration'
#BOX = 'SLE12-SP3-qa'
#BOX = 'SUSE/SLE-12-SP3'
#BOX = 'opensuse/openSUSE-Tumbleweed-x86_64'
BOX = 'opensuse/openSUSE-42.3-x86_64'

# Set INSTALLATION to one of 'ceph-deploy', 'salt'
INSTALLATION = 'salt'

# Set CONFIGURATION to one of 'default', 'small', 'iscsi' or 'economical'
#CONFIGURATION = 'default'
CONFIGURATION = 'tiny'
#CONFIGURATION = 'dataonmon'

I also had to hack the lib/settings.rb to prevent the other issue I had filed.

diff --git a/lib/settings.rb b/lib/settings.rb
index dc5fcb0..107607b 100644
--- a/lib/settings.rb
+++ b/lib/settings.rb
@@ -15,13 +15,13 @@ def common_settings(node, config, name)
 end

 def libvirt_settings(provider, config, name)
-        provider.host = 'localhost'
-        provider.username = 'root'
+        #provider.host = 'localhost'
+        #provider.username = 'root'

         # Use DSA key if available, otherwise, defaults to RSA
-        provider.id_ssh_key_file = 'id_dsa' if File.exists?("#{ENV['HOME']}/.ssh/id_dsa")
-        provider.connect_via_ssh = true
+#        provider.id_ssh_key_file = 'id_dsa' if File.exists?("#{ENV['HOME']}/.ssh/id_dsa")
+#        provider.connect_via_ssh = true

         # Libvirt pool and prefix value
         provider.storage_pool_name = 'default'

hemna commented 6 years ago

my local clone is up to date with github's master

denisok commented 6 years ago

ok, checking...

could you please meanwhile run salt-run state.orch ceph.stage.2

and maybe salt-run state.orch ceph.stage.2 --log-level=debug to get more info on the error?

jschmid1 commented 6 years ago

what does the prometheus-node_exporter's log say? Is there a pointer to the missing piece maybe?

hemna commented 6 years ago

no such option: --log-level

jschmid1 commented 6 years ago

try to append a simple '-l debug'

denisok commented 6 years ago

ok, I can reproduce it. looks like something changed in the packages.

hemna commented 6 years ago

listing the journalctl logs...

Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Service hold-off time over, scheduling restart.
Jan 29 18:23:55 admin systemd[1]: Stopped Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin systemd[1]: Started Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin node_exporter[14757]: node_exporter: error: unknown short flag '-c', try --help
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Unit entered failed state.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Failed with result 'exit-code'.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Service hold-off time over, scheduling restart.
Jan 29 18:23:55 admin systemd[1]: Stopped Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin systemd[1]: Started Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Unit entered failed state.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Failed with result 'exit-code'.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Service hold-off time over, scheduling restart.
Jan 29 18:23:56 admin systemd[1]: Stopped Prometheus exporter for machine metrics.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Start request repeated too quickly.
Jan 29 18:23:56 admin systemd[1]: Failed to start Prometheus exporter for machine metrics.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Unit entered failed state.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Failed with result 'start-limit'.

denisok commented 6 years ago

OK, somehow now in: /etc/sysconfig/prometheus-node_exporter node_exporter wants parameters with "--" instead of short "-" https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/exporters/node_exporter.sls#L12

denisok commented 6 years ago

https://build.opensuse.org/package/show/filesystems:openATTIC:3.x/golang-github-prometheus-node_exporter

Looks like that is due to the latest update to prometheus-node_exporter.

denisok commented 6 years ago

I can workaround node_exporter by manually changing /srv/salt/ceph/monitoring/prometheus/exporters/node_exporter.sls after installation.

    ARGS="--collector.diskstats.ignored-devices=^(ram|loop|fd)\d+$ \
          --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/) \
          --collector.textfile.directory=/var/lib/prometheus/node-exporter"

but also https://build.opensuse.org/package/show/server:monitoring/golang-github-prometheus-alertmanager was changed. So DeepSea fails to update alertmanager.yml config somehow. And I don't know how to workaround it so far.

denisok commented 6 years ago

@jan--f please take a look. There is either something wrong with alertmanager.yml that couldn't be parsed by process:

admin:/home/vagrant # prometheus-alertmanager
level=info ts=2018-01-29T18:50:05.89021018Z caller=main.go:141 msg="Starting Alertmanager" version="(version=, branch=, revision=)"
level=info ts=2018-01-29T18:50:05.89025143Z caller=main.go:142 build_context="(go=go1.9.2, user=, date=)" level=info ts=2018-01-29T18:50:05.890622859Z caller=main.go:279 msg="Loading configuration file" file=/etc/prometheus/alertmanager.yml
level=error ts=2018-01-29T18:50:05.890967371Z caller=main.go:282 msg="Loading configuration file failed" file=/etc/prometheus/alertmanager.yml err="unknown fields in global: hipchat_url"

or maybe DeepSea should generate it, but that didn't happen.

jschmid1 commented 6 years ago

yeah, there seems to be a problem... I'll create a ref issue for Deepsea.

jan--f commented 6 years ago

@denisok Is there an issue with the alertmanager config? At a glance I don't see any change.

denisok commented 6 years ago

@jan--f looks like... at least prometheus-alertmanager fails to start.

Easy to reproduce with BOX = 'opensuse/openSUSE-42.3-x86_64' .

denisok commented 6 years ago

upstream issue was closed.

openSUSE / vagrant-ceph

prometheus node exporter failing #11