scylladb / scylla-ansible-roles

Ansible roles for deploying and managing Scylla, Scylla-Manager and Scylla-Monitoring
44 stars 38 forks source link

[ansible-scylla-node]: parametrise a wait time for a node to boot up #149

Closed vladzcloudius closed 1 year ago

vladzcloudius commented 2 years ago

HEAD: a561f45dfdc0c618577e21361d7e645bc65dcc5e

Description A code in question is this one:

    - name: Wait for CQL port on seeders
      wait_for:
        port: 9042
        host: "{{ listen_address }}"
      when: broadcast_address in scylla_seeds or inventory_hostname in scylla_seeds

    - name: Start scylla non-seeds nodes serially
      run_once: true
      include_tasks: start_one_node.yml
      loop: "{{ groups['scylla'] }}"
      when:
        - item not in scylla_seeds
        - hostvars[item]['broadcast_address'] not in scylla_seeds

And then start_one_node.yml has:

---
- name: start scylla on {{ item }}
  service:
    name: scylla-server
    state: started
  become: true
  delegate_to: "{{ item }}"

# Wait for at most 2 hours for a node to start - bootstrapping and the corresponding streaming can take quite long
- name: Wait for CQL port on {{ hostvars[item]['listen_address'] }}
  wait_for:
    port: 9042
    host: "{{ hostvars[item]['listen_address'] }}"
    timeout: 7200
  delegate_to: "{{ item }}"

This effectively means that a seeder will be waited for 300 seconds (a wait_for default) and a non-seeder - for 7200 seconds (2 hours).

Sometimes 5 minutes for a seeder and 2h for a non-seeder is not enough, especially when a seeder is added to existing cluster and it's seeder list includes itself and some other node - this was once not recommended config since it would have avoided streaming but these days it will work as expected.

And even 2h may not be enough because there may be a lot of data to stream.

We should parametrize these two values or probably make it a single variable.

vladzcloudius commented 2 years ago

@tarzanek FYI

vladzcloudius commented 1 year ago

@igorribeiroduarte could you check if it hasn't been fixed yet. I think it was.

igorribeiroduarte commented 1 year ago

@vladzcloudius The default timeout was increased but not parametrized. I opened a PR parametrizing it -> https://github.com/scylladb/scylla-ansible-roles/pull/197