vmware-archive / prometheus-on-PCF

This is a how-to for deploying https://github.com/cloudfoundry-community/prometheus-boshrelease to monitor Pivotal Cloud Foundry.
Apache License 2.0
20 stars 35 forks source link

nginx not listening on https #17

Open StephenJackStephen opened 6 years ago

StephenJackStephen commented 6 years ago

Hello all, I successfully performed the pipeline deployment of prometheus-on-PCF, but NGINX doesn't appear to be listening on 443, so is not responding on https, though http://nginxip is successfully redirecting to http://nginxip/login. Can anybody direct me on how to get NGINX listening/responding to HTTPS?

Thanks!

mkuratczyk commented 6 years ago

Hi. You need to modify the manifest by adding nginx.ssl_* properties: https://github.com/bosh-prometheus/prometheus-boshrelease/blob/master/jobs/nginx/spec#L49-L55. My plan for the new version of this pipeline is to deploy everything with HTTPS enabled (and likely with HTTP disabled) but haven't done that yet.

StephenJackStephen commented 6 years ago

Thank for the response, mkuratczyk! I added those properties to my /pipeline/tasks/etc/local.yml and I redeployed. The pipeline deployment failed on the "deploy" job with an error indicating "Release version 'node-exporter/1.1.0' doesn't exist", which I figured out was because runtime.yml specifies version 1.1.0 and the "upload-release" job is pulling the latest release from the "node-exporter-boshrelease" github repo. So, my question now is, should I force download of node-exporter release 1.1.0, or should release 3.0.0 be compatible?

mkuratczyk commented 6 years ago

Hi. I've just updated node version and github orgs (stuff was moved from cloudfoundry-community to bosh-prometheus). You can give it a shot now.

If you are using PCF 1.12 I'd recommend switching to https://github.com/pivotal-cf/pcf-prometheus-pipeline. It's been relatively well tested now. It still deploys nginx without SSL so that's still something you'd need to change (I want to do HTTPS-only everywhere but haven't done that yet).

StephenJackStephen commented 6 years ago

I generated the (self-signed) ssl cert & private key and added those to the local.yml as follows, but I'm getting an error indicating "Failed updating instance nginx/0". ''' nginx: vm_type: ${vm_type_micro} vm_password: ${vm_password} static_ips: ${nginx_ip} local_properties: ssl_only: false ssl_cert: -----BEGIN CERTIFICATE----------END CERTIFICATE----- ssl_key: -----BEGIN RSA PRIVATE KEY----------END RSA PRIVATE KEY----- grafana: https_port: 443 http_port: 80 '''

I'm a total newbie, so my apologies if I'm doing something blatantly incorrect. Your guidance would be greatly appreciated!

StephenJackStephen commented 6 years ago

Update to my previous post...I decided to reset my baseline by using your pipeline to deploy a fresh/default Prometheus deployment. The nginx/0 instance is deploying successfully, but I am running into an error with the prometheus/0 instance update.

\ Failed updating instance prometheus > prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b (0) (canary): 'prometheus/0 (6174144c-6916-4c02-b20e-90fa79e26a7b)' is not running after update. Review logs for failed jobs: prometheus, blackbox_exporter, bosh_exporter, cf_exporter, firehose_exporter, collectd_exporter, consul_exporter, github_exporter, graphite_exporter, haproxy_exporter, nats_exporter, pushgateway, rabbitmq_exporter, redis_exporter, statsd_exporter, node_exporter (00:01:29)

Error 400007: 'prometheus/0 (6174144c-6916-4c02-b20e-90fa79e26a7b)' is not running after update. Review logs for failed jobs: prometheus, blackbox_exporter, bosh_exporter, cf_exporter, firehose_exporter, collectd_exporter, consul_exporter, github_exporter, graphite_exporter, haproxy_exporter, nats_exporter, pushgateway, rabbitmq_exporter, redis_exporter, statsd_exporter, node_exporter

Task 2063200 error

For a more detailed error report, run: bosh task 2063200 --debug

mkuratczyk commented 6 years ago

With a BOSH message like this (... not running after update) you need to check the logs of the jobs running on that VM (and listed in the message). So you can do 'bosh instances -p' to see the process that is failing and then 'bosh ssh prometheus', go to /var/vcap/sys/log/failingjob__. Once we know what's wrong we can try to fix it.

StephenJackStephen commented 6 years ago

Agreed and I tried that yesterday and this morning, but I think there's something else fundamentally wrong here. I'm unable to 'bosh ssh prometheus' because I can't target the deployment with 'bosh deployment prometheus.yml' because when I do a 'bosh download manifest prometheus', I get nothing back (i.e., when I do a 'bosh download manifest prometheus ./prometheus.yml', the resulting ./prometheus.yml file ends up being blank/empty). I'm gonna try to bypass this by doing SSH directly to the prometheus [AWS EC2] instance. I will post my findings shortly.

StephenJackStephen commented 6 years ago

prometheus.stderr.log The err log indicates a problem with blackbox_exporter, but...

blackbox_exporter.stderr.log The blackbox_exporter job has no errors in any of the log files.

bosh_exporter.stderr.log No errors in log files for bosh_exporter job.

cf_exporter.stderr.log No errors in log files for cf_exporter job.

firehose_export.stderr.log No errors in the log files for firehose_exporter job.

collectd_export.stderr.log No errors in the log files for collectd_exporter job.

consul_exporter.stderr.log There is a "Can't query consul" error in the log files for consul_exporter job.

github_exporter.log No log entries at all for github_exporter job.

graphite_exporter.stderr.log No errors in the log files for graphite_exporter job.

haproxy_exporter_stderr.log There is a "can't scrape HAProxy" error in the log files for haproxy_exporter job.

nats_exporter.stderr.log There is a "could not retrieve NATS metrics" error in the log files for nats_exporter job.

pushgateway.stderr.log No errors in the log files for pushgateway job.

rabbitmq_exporter.stderr.log There is a "Error while retrieving data from rabbitHost" error in the log files for rabbitmq_exporter job.

redis_exporter.stderr.log There is a "redis err: dial redis: unknown network redis" error in the log files for redis_exporter job.

statsd_exporter.stderr.log There is a "address already in use" fatal error in the log files for statsd_exporter job.

node_exporter.stderr.log No errors in the log files for node_exporter job.

StephenJackStephen commented 6 years ago

And here's the debug output from the bosh task. bosh_task2063200--debug.txt

mkuratczyk commented 6 years ago
  1. Please switch to BOSH CLI v2 :)
  2. 'bosh download manifest' returns nothing if the deployment has never completed successfully
  3. While on the vm, can you do '/var/vcap/bosh/bin/monit summary' ?
StephenJackStephen commented 6 years ago

1) I got BOSH CLI v2 working. 2) Using BOSH CLI v2, I was able to target my prometheus deployment and do a 'bosh ssh' to the prometheus instance. 3) Below is output of monit summary:

prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:~# monit summary The Monit daemon 5.2.5 uptime: 23h 27m

Process 'prometheus' running Process 'blackbox_exporter' running Process 'bosh_exporter' running Process 'cf_exporter' running Process 'firehose_exporter' running Process 'collectd_exporter' running Process 'consul_exporter' running Process 'github_exporter' running Process 'graphite_exporter' running Process 'haproxy_exporter' running Process 'nats_exporter' running Process 'pushgateway' running Process 'rabbitmq_exporter' running Process 'redis_exporter' running Process 'statsd_exporter' not monitored Process 'node_exporter' running System 'system_localhost' running prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:~#

I checked statsd_export logs again. This msg repeats about every 45 seconds: time="2017-12-07T15:38:26Z" level=fatal msg="listen tcp 0.0.0.0:9125: bind: address already in use" source="main.go:181"

I used 'netstat | grep 9125' and confirmed that something is listening on that port: prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:/var/vcap/sys/log/statsd_exporter# netstat | grep 9125 tcp 0 0 localhost:9125 localhost:52194 TIME_WAIT

I used 'lsof -i :9125' to determine what was listening on that port: prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:/var/vcap/sys/log/statsdexporter# lsof -i :9125 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME rabbitmq 8847 root 3u IPv4 22823 0t0 TCP *:9125 (LISTEN)

Double-confirmed with 'netstat -peanut | grep 9125': prometheus/6174144c-6916-4c02-b20e-90fa79e26a7b:/var/vcap/sys/log/statsd_exporter# netstat -peanut | grep 9125 tcp 0 0 0.0.0.0:9125 0.0.0.0:* LISTEN 0 22823 8847/rabbitmq_expor tcp 0 0 127.0.0.1:9125 127.0.0.1:58138 TIME_WAIT 0 0 -

mkuratczyk commented 6 years ago

I'm pretty sure you don't need statsd_exporter so you can simply remove it from your manifest.

StephenJackStephen commented 6 years ago

I commented out the statsdexporter job from the prometheus release within my prometheus.yml manifest. Pipeline reran successfully...I am super pumped at this point! Thanks for helping get this far! I think I'm ready to put those nginx.ssl* properties into place now. I will give that a go and provide update shortly.

StephenJackStephen commented 6 years ago

I'm getting so close! OK, I added the 'ssl_cert' and 'ssl_key' properties to local.yml and repushed deployment. It's failing with error: Error 400007: 'nginx/0 (75eb11f6-3498-4c7f-b66a-31bbd0025c2b)' is not running after update. Review logs for failed jobs: nginx, node_exporter

I 'bosh ssh' to nginx/0 and find the following errors in nginx.stderr.log: 2017/12/07 21:59:37 [emerg] 23299#0: PEM_read_bio_X509_AUX("/var/vcap/jobs/nginx/config/ssl_cert.pem") failed (SSL: error:0906D06C:PEM routines:PEM_read_bio:no start line:Expecting: TRUSTED CERTIFICATE)

Regarding this error, I've confirmed that my crt and key are valid and match, but they are self-signed. I also double-checked that the crt and the key values are mated to their respective properties (not switched by accident, as has been pointed out on some posts/blogs). I also checked the nginx.conf file, and all looks correct, but I've attached for your review. Thanks! nginx.conf.txt

mkuratczyk commented 6 years ago

I guess it's a manifest (yaml) formatting issue. The properties should look like this:

ssl_cert: |
  -----BEGIN CERTIFICATE-----
  <ascii_chars>
  -----END CERTIFICATE-----

ssl_key: |
  -----BEGIN RSA PRIVATE KEY-----
  <ascii_chars>
  -----END RSA PRIVATE KEY-----
StephenJackStephen commented 6 years ago

That did the trick! Thanks so much for your help on this...can't thank you enough!

StephenJackStephen commented 6 years ago

After last night's success, I took the next step by attempting to turn the cert & key property values into variables, as follows:

ssl_cert: |
   ${nginx_ssl_cert}
ssl_key: |
   ${nginx_ssl_key}

and also tried this syntax:

ssl_cert: ${nginx_ssl_cert}
ssl_key: ${nginx_ssl_key}

Both resulted in the following error during deploy.

Interpolating...
invalid argument for flag `-l, --vars-file' (expected []template.VarsFileArg): Deserializing variables file 'local.yml': yaml: line 13: could not find expected ':'
Exit code 1

Reason I'm trying to use variables is because I want to use the same pipeline code for multiple environments and I want to eventually put the keys into Vault. I've been doing searches and trying to figure out the problem before posting here again, but can't seem to find anything on the Internet that steers me in the right direction on this, so, again, your guidance is greatly appreciated! If this should be in a new issue thread, please let me know. Thanks!

mkuratczyk commented 6 years ago

Variables should be in double parenthesis as in ((my_variable))

StephenJackStephen commented 6 years ago

I should have mentioned that I tried that as well:

ssl_cert: ((nginx_ssl_cert))
ssl_key: ((nginx_ssl_key))

Resulting in the /var/vcap/jobs/nginx/config/*.pem files being populated with literal "((variable))":

nginx/75eb11f6-3498-4c7f-b66a-31bbd0025c2b:/var/vcap/jobs/nginx/config$ cat *cert*
((nginx_ssl_cert))
nginx/75eb11f6-3498-4c7f-b66a-31bbd0025c2b:/var/vcap/jobs/nginx/config$ cat *key*
((nginx_ssl_key))

I observed that the local.yml file had other variables with syntax ${variable} as:

nginx:
  vm_type: ${vm_type_micro}
  vm_password: ${vm_password}
  static_ips: ${nginx_ip}
  local_properties:
    ssl_only: false
    ssl_cert: ${nginx_ssl_cert}
    ssl_key: ${nginx_ssl_key}
    grafana:
      https_port: 443
      http_port: 80
mkuratczyk commented 6 years ago

Well, seems like you did everything fine but you said this value should be taken from a variable and then you didn't specify the value for this variable. Why did you want to make it a variable in local.yml? local.yml is meant to be exactly the file where you define the values for your variables so if you use a variable in local.yml then you need to use 'bosh -v' or 'bosh -l' to specify the value but I'd rather just put the value in local.yml as you did before.

StephenJackStephen commented 6 years ago

Basically, I'm mimicking what's being done to pass the IP of the the nginx instance. In the local.yml file, there is a property static_ips: with a variable ${nginx_ip} as the value. The value of ${nginx_ip} is passed as parameter nginx_ip from the deploy job of the pipeline, the value for which is also a variable deploy_nginx_ip, which is defined in the --var-file being passed to the pipeline. I want to do the same thing with the cert & key...define it in the --var-file that I pass to the pipeline, and have that value passed along the same way to local.yml, where it gets interpolated into prometheus.yml and outputted as manifest.yml, then pushed into the nginx config.

Two primary reasons I'm trying to do it this way are

  1. We have multiple PCF foundations, so the way we handle variations in config parameters is by passing --var-file that correspond with the environment to which the pipeline is pushing.
  2. Ultimately, I need to put the cert & key into Vault to meet corporate security policy.

I hope what I described makes sense...my apologies if I'm conflating various terminologies.

mkuratczyk commented 6 years ago

if you could share what you have I guess I could spot the problem. without that it's pretty hard. I can only suggest to search for nginx_ip and adding everything you need to all places where you find nginx_ip (as you said, you want to mimic that). This appears in prometheus.yml and local.yml but also pipeline.yml and tasks/deploy.{yml,sh}. Make sure you didn't forget it somewhere.