Open StephenJackStephen opened 6 years ago
Hi. You need to modify the manifest by adding nginx.ssl_* properties: https://github.com/bosh-prometheus/prometheus-boshrelease/blob/master/jobs/nginx/spec#L49-L55. My plan for the new version of this pipeline is to deploy everything with HTTPS enabled (and likely with HTTP disabled) but haven't done that yet.
Thank for the response, mkuratczyk! I added those properties to my
Hi. I've just updated node version and github orgs (stuff was moved from cloudfoundry-community to bosh-prometheus). You can give it a shot now.
If you are using PCF 1.12 I'd recommend switching to https://github.com/pivotal-cf/pcf-prometheus-pipeline. It's been relatively well tested now. It still deploys nginx without SSL so that's still something you'd need to change (I want to do HTTPS-only everywhere but haven't done that yet).
I generated the (self-signed) ssl cert & private key and added those to the local.yml as follows, but I'm getting an error indicating "Failed updating instance nginx/0".
'''
nginx:
vm_type: ${vm_type_micro}
vm_password: ${vm_password}
static_ips: ${nginx_ip}
local_properties:
ssl_only: false
ssl_cert: -----BEGIN CERTIFICATE-----
I'm a total newbie, so my apologies if I'm doing something blatantly incorrect. Your guidance would be greatly appreciated!
Update to my previous post...I decided to reset my baseline by using your pipeline to deploy a fresh/default Prometheus deployment. The nginx/0 instance is deploying successfully, but I am running into an error with the prometheus/0 instance update.
\ Failed updating instance prometheus > prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b (0) (canary): 'prometheus/0 (6174144c-6916-4c02-b20e-90fa79e26a7b)' is not running after update. Review logs for failed jobs: prometheus, blackbox_exporter, bosh_exporter, cf_exporter, firehose_exporter, collectd_exporter, consul_exporter, github_exporter, graphite_exporter, haproxy_exporter, nats_exporter, pushgateway, rabbitmq_exporter, redis_exporter, statsd_exporter, node_exporter (00:01:29)
Error 400007: 'prometheus/0 (6174144c-6916-4c02-b20e-90fa79e26a7b)' is not running after update. Review logs for failed jobs: prometheus, blackbox_exporter, bosh_exporter, cf_exporter, firehose_exporter, collectd_exporter, consul_exporter, github_exporter, graphite_exporter, haproxy_exporter, nats_exporter, pushgateway, rabbitmq_exporter, redis_exporter, statsd_exporter, node_exporter
Task 2063200 error
For a more detailed error report, run: bosh task 2063200 --debug
With a BOSH message like this (... not running after update) you need to check the logs of the jobs running on that VM (and listed in the message). So you can do 'bosh instances -p' to see the process that is failing and then 'bosh ssh prometheus', go to /var/vcap/sys/log/failingjob__. Once we know what's wrong we can try to fix it.
Agreed and I tried that yesterday and this morning, but I think there's something else fundamentally wrong here. I'm unable to 'bosh ssh prometheus' because I can't target the deployment with 'bosh deployment prometheus.yml' because when I do a 'bosh download manifest prometheus', I get nothing back (i.e., when I do a 'bosh download manifest prometheus ./prometheus.yml', the resulting ./prometheus.yml file ends up being blank/empty). I'm gonna try to bypass this by doing SSH directly to the prometheus [AWS EC2] instance. I will post my findings shortly.
prometheus.stderr.log The err log indicates a problem with blackbox_exporter, but...
blackbox_exporter.stderr.log The blackbox_exporter job has no errors in any of the log files.
bosh_exporter.stderr.log No errors in log files for bosh_exporter job.
cf_exporter.stderr.log No errors in log files for cf_exporter job.
firehose_export.stderr.log No errors in the log files for firehose_exporter job.
collectd_export.stderr.log No errors in the log files for collectd_exporter job.
consul_exporter.stderr.log There is a "Can't query consul" error in the log files for consul_exporter job.
github_exporter.log No log entries at all for github_exporter job.
graphite_exporter.stderr.log No errors in the log files for graphite_exporter job.
haproxy_exporter_stderr.log There is a "can't scrape HAProxy" error in the log files for haproxy_exporter job.
nats_exporter.stderr.log There is a "could not retrieve NATS metrics" error in the log files for nats_exporter job.
pushgateway.stderr.log No errors in the log files for pushgateway job.
rabbitmq_exporter.stderr.log There is a "Error while retrieving data from rabbitHost" error in the log files for rabbitmq_exporter job.
redis_exporter.stderr.log There is a "redis err: dial redis: unknown network redis" error in the log files for redis_exporter job.
statsd_exporter.stderr.log There is a "address already in use" fatal error in the log files for statsd_exporter job.
node_exporter.stderr.log No errors in the log files for node_exporter job.
And here's the debug output from the bosh task. bosh_task2063200--debug.txt
1) I got BOSH CLI v2 working. 2) Using BOSH CLI v2, I was able to target my prometheus deployment and do a 'bosh ssh' to the prometheus instance. 3) Below is output of monit summary:
prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:~# monit summary The Monit daemon 5.2.5 uptime: 23h 27m
Process 'prometheus' running Process 'blackbox_exporter' running Process 'bosh_exporter' running Process 'cf_exporter' running Process 'firehose_exporter' running Process 'collectd_exporter' running Process 'consul_exporter' running Process 'github_exporter' running Process 'graphite_exporter' running Process 'haproxy_exporter' running Process 'nats_exporter' running Process 'pushgateway' running Process 'rabbitmq_exporter' running Process 'redis_exporter' running Process 'statsd_exporter' not monitored Process 'node_exporter' running System 'system_localhost' running prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:~#
I checked statsd_export logs again. This msg repeats about every 45 seconds: time="2017-12-07T15:38:26Z" level=fatal msg="listen tcp 0.0.0.0:9125: bind: address already in use" source="main.go:181"
I used 'netstat | grep 9125' and confirmed that something is listening on that port: prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:/var/vcap/sys/log/statsd_exporter# netstat | grep 9125 tcp 0 0 localhost:9125 localhost:52194 TIME_WAIT
I used 'lsof -i :9125' to determine what was listening on that port: prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:/var/vcap/sys/log/statsdexporter# lsof -i :9125 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME rabbitmq 8847 root 3u IPv4 22823 0t0 TCP *:9125 (LISTEN)
Double-confirmed with 'netstat -peanut | grep 9125': prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:/var/vcap/sys/log/statsd_exporter# netstat -peanut | grep 9125 tcp 0 0 0.0.0.0:9125 0.0.0.0:* LISTEN 0 22823 8847/rabbitmq_expor tcp 0 0 127.0.0.1:9125 127.0.0.1:58138 TIME_WAIT 0 0 -
I'm pretty sure you don't need statsd_exporter so you can simply remove it from your manifest.
I commented out the statsdexporter job from the prometheus release within my prometheus.yml manifest. Pipeline reran successfully...I am super pumped at this point! Thanks for helping get this far! I think I'm ready to put those nginx.ssl* properties into place now. I will give that a go and provide update shortly.
I'm getting so close! OK, I added the 'ssl_cert' and 'ssl_key' properties to local.yml and repushed deployment. It's failing with error:
Error 400007: 'nginx/0 (75eb11f6-3498-4c7f-b66a-31bbd0025c2b)' is not running after update. Review logs for failed jobs: nginx, node_exporter
I 'bosh ssh' to nginx/0 and find the following errors in nginx.stderr.log:
2017/12/07 21:59:37 [emerg] 23299#0: PEM_read_bio_X509_AUX("/var/vcap/jobs/nginx/config/ssl_cert.pem") failed (SSL: error:0906D06C:PEM routines:PEM_read_bio:no start line:Expecting: TRUSTED CERTIFICATE)
Regarding this error, I've confirmed that my crt and key are valid and match, but they are self-signed. I also double-checked that the crt and the key values are mated to their respective properties (not switched by accident, as has been pointed out on some posts/blogs). I also checked the nginx.conf file, and all looks correct, but I've attached for your review. Thanks! nginx.conf.txt
I guess it's a manifest (yaml) formatting issue. The properties should look like this:
ssl_cert: |
-----BEGIN CERTIFICATE-----
<ascii_chars>
-----END CERTIFICATE-----
ssl_key: |
-----BEGIN RSA PRIVATE KEY-----
<ascii_chars>
-----END RSA PRIVATE KEY-----
That did the trick! Thanks so much for your help on this...can't thank you enough!
After last night's success, I took the next step by attempting to turn the cert & key property values into variables, as follows:
ssl_cert: |
${nginx_ssl_cert}
ssl_key: |
${nginx_ssl_key}
and also tried this syntax:
ssl_cert: ${nginx_ssl_cert}
ssl_key: ${nginx_ssl_key}
Both resulted in the following error during deploy.
Interpolating...
invalid argument for flag `-l, --vars-file' (expected []template.VarsFileArg): Deserializing variables file 'local.yml': yaml: line 13: could not find expected ':'
Exit code 1
Reason I'm trying to use variables is because I want to use the same pipeline code for multiple environments and I want to eventually put the keys into Vault. I've been doing searches and trying to figure out the problem before posting here again, but can't seem to find anything on the Internet that steers me in the right direction on this, so, again, your guidance is greatly appreciated! If this should be in a new issue thread, please let me know. Thanks!
Variables should be in double parenthesis as in ((my_variable))
I should have mentioned that I tried that as well:
ssl_cert: ((nginx_ssl_cert))
ssl_key: ((nginx_ssl_key))
Resulting in the /var/vcap/jobs/nginx/config/*.pem files being populated with literal "((variable))":
nginx/75eb11f6-3498-4c7f-b66a-31bbd0025c2b:/var/vcap/jobs/nginx/config$ cat *cert*
((nginx_ssl_cert))
nginx/75eb11f6-3498-4c7f-b66a-31bbd0025c2b:/var/vcap/jobs/nginx/config$ cat *key*
((nginx_ssl_key))
I observed that the local.yml file had other variables with syntax ${variable} as:
nginx:
vm_type: ${vm_type_micro}
vm_password: ${vm_password}
static_ips: ${nginx_ip}
local_properties:
ssl_only: false
ssl_cert: ${nginx_ssl_cert}
ssl_key: ${nginx_ssl_key}
grafana:
https_port: 443
http_port: 80
Well, seems like you did everything fine but you said this value should be taken from a variable and then you didn't specify the value for this variable. Why did you want to make it a variable in local.yml? local.yml is meant to be exactly the file where you define the values for your variables so if you use a variable in local.yml then you need to use 'bosh -v' or 'bosh -l' to specify the value but I'd rather just put the value in local.yml as you did before.
Basically, I'm mimicking what's being done to pass the IP of the the nginx instance. In the local.yml file, there is a property static_ips:
with a variable ${nginx_ip}
as the value. The value of ${nginx_ip}
is passed as parameter nginx_ip
from the deploy job of the pipeline, the value for which is also a variable deploy_nginx_ip
, which is defined in the --var-file being passed to the pipeline. I want to do the same thing with the cert & key...define it in the --var-file that I pass to the pipeline, and have that value passed along the same way to local.yml, where it gets interpolated into prometheus.yml and outputted as manifest.yml, then pushed into the nginx config.
Two primary reasons I'm trying to do it this way are
I hope what I described makes sense...my apologies if I'm conflating various terminologies.
if you could share what you have I guess I could spot the problem. without that it's pretty hard. I can only suggest to search for nginx_ip and adding everything you need to all places where you find nginx_ip (as you said, you want to mimic that). This appears in prometheus.yml and local.yml but also pipeline.yml and tasks/deploy.{yml,sh}. Make sure you didn't forget it somewhere.
Hello all, I successfully performed the pipeline deployment of prometheus-on-PCF, but NGINX doesn't appear to be listening on 443, so is not responding on https, though http://nginxip is successfully redirecting to http://nginxip/login. Can anybody direct me on how to get NGINX listening/responding to HTTPS?
Thanks!