orange-cloudfoundry / cassandra-boshrelease

Apache License 2.0
8 stars 9 forks source link

v0.6: nodes randomly fail to start with passwords mismatch #62

Open gberche-orange opened 6 years ago

gberche-orange commented 6 years ago

We observe randomly the following symptom with v0.6 release. We're not yet sure whether the root cause comes from the environment (bosh director or infrastructure) or from the cassandra 0.6 release.

Task 128129 | 07:27:33 | Updating instance cassandra-seeds: cassandra-seeds/41da6c3c-0d85-4049-b5b6-7ee8b34f6cfa (0) (canary) (00:01:48)
                      L Error: Action Failed get_task: Task 7b430e3e-cf68-4a3f-43c4-d2e16c9fbfc8 result: 1 of 2 post-start scripts failed. Failed Jobs: cassandra. Successful Jobs: bosh-dns.
Task 128129 | 07:29:34 | Updating instance cassandra-servers: cassandra-servers/e2699534-66fa-4372-b18c-82d963c2ff4f (0) (canary) (00:02:05)
                      L Error: Action Failed get_task: Task f2c8e89c-dc44-4a26-5ae0-8a8fb18a70af result: 1 of 2 post-start scripts failed. Failed Jobs: cassandra. Successful Jobs: bosh-dns.

However, after a few minutes the deployment status displays the nodes as running

$ bosh instances
Using environment '192.168.99.155' as client 'xx'

Task 128310. Done

Deployment 'c_072dd24d-c2aa-486c-88c6-6c362ae4f609'

Instance                                                Process State  AZ  IPs  
cassandra-broker/2f5ccb89-83d1-4d59-bb87-8254daeb694a   failing        z1  192.168.211.34  
cassandra-seeds/41da6c3c-0d85-4049-b5b6-7ee8b34f6cfa    running        z1  192.168.211.25  
cassandra-seeds/7e405a54-1acb-4fdd-a015-8650651b1e1a    running        z1  192.168.211.31  
cassandra-seeds/f3fc209a-7117-49e7-9924-5c0d84f3b5fa    running        z1  192.168.211.32  
cassandra-servers/e2699534-66fa-4372-b18c-82d963c2ff4f  running        z1  192.168.211.33  

5 instances

Looking at /var/vcap/sys/log/cassandra/post-start.stderr.log on cassandra-seeds/41da6c3c-0d85-4049-b5b6-7ee8b34f6cfa we repeatedly see

2018-07-03_07:39:53: DEBUG: setting first password, exit status: '1'
2018-07-03_07:39:53: INFO: verifying that the current password is the desired password
Connection error: ('Unable to connect to any servers', {'192.168.211.25': AuthenticationFailed('Failed to authenticate to 192.168.211.25: Error from server: code=0100 [Bad credentials] message="Provided username cassandra and/or password are incorrect"',)})
2018-07-03_07:39:53: DEBUG: verifying current password, exit status: '1'
2018-07-03_07:39:53: ERROR: the password for user 'cassandra' is inconsistent. Aborting.
2018-07-03_07:44:21: INFO: reached Cassandra on '192.168.211.25:9042' after '14' attemps. Waiting 30 more seconds for the service to be available.

2018-07-03_07:44:51: INFO: setting first password
Connection error: ('Unable to connect to any servers', {'192.168.211.25': AuthenticationFailed('Failed to authenticate to 192.168.211.25: Error from server: code=0100 [Bad credentials] message="Provided username cassandra and/or password are incorrect"',)})
2018-07-03_07:44:52: DEBUG: setting first password, exit status: '1'
2018-07-03_07:44:52: INFO: verifying that the current password is the desired password
Connection error: ('Unable to connect to any servers', {'192.168.211.25': AuthenticationFailed('Failed to authenticate to 192.168.211.25: Error from server: code=0100 [Bad credentials] message="Provided username cassandra and/or password are incorrect"',)})
2018-07-03_07:44:52: DEBUG: verifying current password, exit status: '1'
2018-07-03_07:44:52: ERROR: the password for user 'cassandra' is inconsistent. Aborting.

following is the associated bosh deployment manifest

---
instance_groups:
- azs:
  - z1
  env:
    bosh:
      remove_dev_tools: true
      swap_size: 0
  instances: 3
  jobs:
  - consumes:
      seeds:
        from: deployment-seeds
    name: cassandra
    properties:
      cass_KSP: "((!cassandra_key_store_pass))"
      cass_pwd: "((!cassandra_admin_password))"
      cassandra_ssl_YN: false
      client_encryption_options:
        enabled: false
        optional: true
        require_client_auth: false
      cluster_name: cluster
      heap_newsize: 1G
      max_heap_size: 6G
      num_tokens: 256
      server_encryptions:
        internode_encryption: none
      topology:
      - 10.8.32.60=DC1:RAC1
      - 10.8.32.61=DC1:RAC1
      - 10.8.32.62=DC1:RAC1
      - 10.8.32.63=DC1:RAC1
      validate_ssl_TF: false
    provides:
      seeds:
        as: deployment-seeds
    release: cassandra
  name: cassandra-seeds
  networks:
  - name: tf-net-coab-depls-instance
  persistent_disk_type: xlarge
  stemcell: trusty
  vm_type: large
- azs:
  - z1
  env:
    bosh:
      remove_dev_tools: true
      swap_size: 100
  instances: 1
  jobs:
  - consumes:
      seeds:
        from: deployment-seeds
    name: cassandra
    properties:
      cass_KSP: "((!cassandra_key_store_pass))"
      cass_pwd: "((!cassandra_admin_password))"
      cassandra_ssl_YN: false
      client_encryption_options:
        enabled: false
        optional: true
        require_client_auth: false
      cluster_name: cluster
      heap_newsize: 1G
      max_heap_size: 6G
      num_tokens: 256
      server_encryptions:
        internode_encryption: none
      topology:
      - 10.8.32.60=DC1:RAC1
      - 10.8.32.61=DC1:RAC1
      - 10.8.32.62=DC1:RAC1
      - 10.8.32.63=DC1:RAC1
      validate_ssl_TF: false
    release: cassandra
  name: cassandra-servers
  networks:
  - name: tf-net-coab-depls-instance
  persistent_disk_type: xlarge
  stemcell: trusty
  vm_type: large
- azs:
  - z1
  instances: 1
  jobs:
  - name: broker-smoke-tests
    properties:
      cf:
        admin:
          password: "((/secrets/cloudfoundry_admin_password))"
          username: admin
        api:
          url: https://api.((/secrets/cloudfoundry_system_domain))
        cassandra:
          appdomain: "((/secrets/cloudfoundry_apps_domain))"
          serviceinstancename: cassandra-instance
          servicename: cassandra
          serviceplan: default
        org: service-sandbox
        skip:
          ssl:
            validation: true
        space: cassandra-smoke-tests
    release: cassandra
  - consumes:
      seeds:
        from: deployment-seeds
    name: broker
    properties:
      broker:
        password: "((/secrets/cloudfoundry_service_brokers_cassandra_password))"
        user: cassandra-broker
      cassandra_seed:
        admin_password: "((!cassandra_admin_password))"
    release: cassandra
  - name: route-registrar
    properties:
      route_registrar:
        external_host: cassandra-broker-c_ee617363-8821-43da-8034-efb2d9343654.((!/secrets/cloudfoundry_system_domain))
        health_checker:
          interval: 10
          name: healthchk
        message_bus_servers:
        - host: "((/secrets/cloudfoundry_nats_host)):4222"
          password: "((/secrets/cloudfoundry_nats_password))"
          user: nats
        port: 8080
    release: route-registrar
  name: cassandra-broker
  networks:
  - name: tf-net-coab-depls-instance
  persistent_disk_type: xlarge
  stemcell: trusty
  vm_type: large
name: c_ee617363-8821-43da-8034-efb2d9343654
releases:
- name: cassandra
  version: '6'
- name: route-registrar
  version: '3'
stemcells:
- alias: trusty
  os: ubuntu-trusty
  version: '3468.25'
update:
  canaries: 1
  canary_watch_time: 30000-240000
  max_in_flight: 1
  serial: false
  update_watch_time: 30000-240000
variables:
- name: cassandra_admin_password
  type: password
- name: cassandra_key_store_pass
  type: password

additional release

$ bosh releases
Using environment '192.168.99.155' as client 'xx'

Name              Version  Commit Hash  
bosh-dns          0.2.0*   304d6ca  
cassandra         6*       33952d4  
mongodb-services  3*       688f3ec  
node-exporter     1.1.0    d2706592+  
os-conf           19*      22510c5  
prometheus        21.1.0*  75e3e4b  
route-registrar   3*       f7132692+  
syslog            11*      0e06601  
weave-scope       0.0.17*  f0cc5de2+  

(*) Currently deployed
(+) Uncommitted changes

9 releases

/CC @JCL38-ORANGE @poblin-orange

bgandon commented 6 years ago

This error means the password for the cassandra admin user is not the default cassandra and not the one that the deployment specifies.

This typically happens when you try to set a new value to the cassandra_password property. In v6, you can never change the value of this property.

You must upgrade to v8 in order to benefit from the password update feature (#48) for the cassandra admin user.

Plus, there is now a dedicated cassandra-deployment Git repo for the v8, which eases things a lot when it comes to writing a COA deployment. Everything is easier with v8. You should not use v6 anymore.

If you don't mind, we'll close this issue.