Closed hemna closed 6 years ago
right, what config are you using?
BOX = 'opensuse/openSUSE-42.3-x86_64'
INSTALLATION = 'salt'
CONFIGURATION = 'tiny'
?
yep.
do you use a fresh master?
it was fixed with https://github.com/openSUSE/vagrant-ceph/commit/88adae6745be66cba4ebe9243d56420beb0fd2c9
#BOX = 'opensuse/openSUSE-42.2-x86_64'
#BOX = 'SLE12-SP2-migration'
#BOX = 'SLE12-SP3-qa'
#BOX = 'SUSE/SLE-12-SP3'
#BOX = 'opensuse/openSUSE-Tumbleweed-x86_64'
BOX = 'opensuse/openSUSE-42.3-x86_64'
# Set INSTALLATION to one of 'ceph-deploy', 'salt'
INSTALLATION = 'salt'
# Set CONFIGURATION to one of 'default', 'small', 'iscsi' or 'economical'
#CONFIGURATION = 'default'
CONFIGURATION = 'tiny'
#CONFIGURATION = 'dataonmon'
I also had to hack the lib/settings.rb to prevent the other issue I had filed.
diff --git a/lib/settings.rb b/lib/settings.rb
index dc5fcb0..107607b 100644
--- a/lib/settings.rb
+++ b/lib/settings.rb
@@ -15,13 +15,13 @@ def common_settings(node, config, name)
end
def libvirt_settings(provider, config, name)
- provider.host = 'localhost'
- provider.username = 'root'
+ #provider.host = 'localhost'
+ #provider.username = 'root'
# Use DSA key if available, otherwise, defaults to RSA
- provider.id_ssh_key_file = 'id_dsa' if File.exists?("#{ENV['HOME']}/.ssh/id_dsa")
- provider.connect_via_ssh = true
+# provider.id_ssh_key_file = 'id_dsa' if File.exists?("#{ENV['HOME']}/.ssh/id_dsa")
+# provider.connect_via_ssh = true
# Libvirt pool and prefix value
provider.storage_pool_name = 'default'
my local clone is up to date with github's master
ok, checking...
could you please meanwhile run salt-run state.orch ceph.stage.2
and maybe salt-run state.orch ceph.stage.2 --log-level=debug
to get more info on the error?
what does the prometheus-node_exporter's log say? Is there a pointer to the missing piece maybe?
no such option: --log-level
try to append a simple '-l debug'
ok, I can reproduce it. looks like something changed in the packages.
listing the journalctl logs...
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Service hold-off time over, scheduling restart.
Jan 29 18:23:55 admin systemd[1]: Stopped Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin systemd[1]: Started Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin node_exporter[14757]: node_exporter: error: unknown short flag '-c', try --help
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Unit entered failed state.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Failed with result 'exit-code'.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Service hold-off time over, scheduling restart.
Jan 29 18:23:55 admin systemd[1]: Stopped Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin systemd[1]: Started Prometheus exporter for machine metrics.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Unit entered failed state.
Jan 29 18:23:55 admin systemd[1]: prometheus-node_exporter.service: Failed with result 'exit-code'.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Service hold-off time over, scheduling restart.
Jan 29 18:23:56 admin systemd[1]: Stopped Prometheus exporter for machine metrics.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Start request repeated too quickly.
Jan 29 18:23:56 admin systemd[1]: Failed to start Prometheus exporter for machine metrics.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Unit entered failed state.
Jan 29 18:23:56 admin systemd[1]: prometheus-node_exporter.service: Failed with result 'start-limit'.
OK, somehow now in: /etc/sysconfig/prometheus-node_exporter node_exporter wants parameters with "--" instead of short "-" https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/exporters/node_exporter.sls#L12
Looks like that is due to the latest update to prometheus-node_exporter.
I can workaround node_exporter by manually changing /srv/salt/ceph/monitoring/prometheus/exporters/node_exporter.sls after installation.
ARGS="--collector.diskstats.ignored-devices=^(ram|loop|fd)\d+$ \ --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/) \ --collector.textfile.directory=/var/lib/prometheus/node-exporter"
but also https://build.opensuse.org/package/show/server:monitoring/golang-github-prometheus-alertmanager was changed. So DeepSea fails to update alertmanager.yml config somehow. And I don't know how to workaround it so far.
@jan--f please take a look. There is either something wrong with alertmanager.yml that couldn't be parsed by process:
admin:/home/vagrant # prometheus-alertmanager
level=info ts=2018-01-29T18:50:05.89021018Z caller=main.go:141 msg="Starting Alertmanager" version="(version=, branch=, revision=)"
level=info ts=2018-01-29T18:50:05.89025143Z caller=main.go:142 build_context="(go=go1.9.2, user=, date=)" level=info ts=2018-01-29T18:50:05.890622859Z caller=main.go:279 msg="Loading configuration file" file=/etc/prometheus/alertmanager.yml
level=error ts=2018-01-29T18:50:05.890967371Z caller=main.go:282 msg="Loading configuration file failed" file=/etc/prometheus/alertmanager.yml err="unknown fields in global: hipchat_url"
or maybe DeepSea should generate it, but that didn't happen.
yeah, there seems to be a problem... I'll create a ref issue for Deepsea.
@denisok Is there an issue with the alertmanager config? At a glance I don't see any change.
@jan--f looks like... at least prometheus-alertmanager fails to start.
Easy to reproduce with BOX = 'opensuse/openSUSE-42.3-x86_64' .
upstream issue was closed.
after running stage 0 and 1, I get failures with the prometheus node exporter.