scylladb / scylla-ansible-roles

Ansible roles for deploying and managing Scylla, Scylla-Manager and Scylla-Monitoring
44 stars 38 forks source link

scylla-node: read io.conf before io_properties.yaml #348

Closed RAttab closed 8 months ago

RAttab commented 8 months ago

Problem

Ran into this problem while trying to deploy a new cluster.

TASK [../../../roles/scylla-ansible-roles/ansible-scylla-node : Set io.conf] ******************************************************************************************************
fatal: [prod-scylla-nyj-1.example.com]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'io_conf'. 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'io_conf'\n\nThe error appears to be in '/Users/remi/code/infra/ansible/roles/scylla-ansible-roles/ansible-scylla-node/tasks/common.yml': line 143, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Set io.conf\n  ^ here\n"}

As per the Ansible documentation for blocks:

The directive does not affect the block itself, it is only inherited by the tasks enclosed by a block. For example, a when statement is applied to the tasks within a block, not to the block itself.

Since the io_conf fact is set after the step to set the io_properties fact, the when clause for the block that checks whether io_properties is set will prevent the io_conf fact from ever being set. This in turns causes the Set io.conf task to fail as io_conf was never set.

Solution

Doing the io.conf steps before the io.properties steps in the block solves the issue.

vladzcloudius commented 8 months ago

@RAttab could you, please, specify what was the HEAD hash with which you hit the issue above?

vladzcloudius commented 8 months ago

The PR makes little sense to TBH because the error you reported above was not supposed to happen if you are using the unmodified code.

Look at the when clause of task which you report as failing: https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/common.yml#L136:

- name: Set io.conf
  lineinfile:
    path: /etc/scylla.d/io.conf
    regexp: '^SEASTAR_IO='
    line: "{% if scylla_io_probe|bool and scylla_io_probe_dc_aware|bool %}{{ hostvars[dc_to_node_list[dc][0]]['io_conf'] }}{% else %}{{ io_conf }}{% endif %}"
    create: yes
  become: true
  when: io_conf is defined and (_io_conf_file.stat.exists|bool == false or _io_conf_out.stdout|length == 0 or always_replace_io_conf|bool)

Note the first condition: io_conf is defined so in case io_conf is undefined this whole task was supposed to be skipped. However in your report it is executed (and fails).

Could it be that you changed the code before executing?

Or maybe there is something else that makes it to fail - but not the fact that io_conf is undefined.

RAttab commented 8 months ago

I ran into the problem using the af92c08 HEAD with no additional modification.

Note that the error is not raised for the io_conf variable declared for the current node. The error is raised when accessing the io_conf variable of another node through hostvars[dc_to_node_list[dc][0]]['io_conf'] which is not checked by the when clause of that task.

Memory is a bit fuzzy but if I recall correctly the error didn't happen when I first executed the block. It started to happen on re-running the block after failures happened later in the role due to configuration mistakes. I'll see if I can reproduce the issue to give you a full step by step break down but that might take me a few days as I'm a bit time constraint at the moment.

I do know that my change fixes the issue and aligns a bit better with the when statement of the block. That being said, I also think the proper fix may be to adjust the when statement to include a check for the io_conf variable. I only had so much time to dedicate to this particular problem so I did the simplest and most obvious fix that would allow me to keep going.

vladzcloudius commented 8 months ago

I ran into the problem using the af92c08 HEAD with no additional modification.

Note that the error is not raised for the io_conf variable declared for the current node. The error is raised when accessing the io_conf variable of another node through hostvars[dc_to_node_list[dc][0]]['io_conf'] which is not checked by the when clause of that task.

Memory is a bit fuzzy but if I recall correctly the error didn't happen when I first executed the block. It started to happen on re-running the block after failures happened later in the role due to configuration mistakes. I'll see if I can reproduce the issue to give you a full step by step break down but that might take me a few days as I'm a bit time constraint at the moment.

I do know that my change fixes the issue and aligns a bit better with the when statement of the block. That being said, I also think the proper fix may be to adjust the when statement to include a check for the io_conf variable. I only had so much time to dedicate to this particular problem so I did the simplest and most obvious fix that would allow me to keep going.

This still doesn't make sense to me unless you used some rather unorthodox parametrization, e.g. defining per-host io_properties dictionary while omitting entries for some hosts.

The intention of the code piece in question is that values io_properties, io_conf, scylla_io_probe and scylla_io_probe_dc_aware are either not set to all nodes or set to all nodes to the same value.

If the above holds I don't understand how you can have a defined io_conf for some host and still have it undefined in the first host in the inventory from the same DC: hostvars[dc_to_node_list[dc][0]].

I would need more information about the exact Role parameters that were used when you saw the error above.

And I still don't understand what problem your patch actually resolves: the block in question simply reads the content of corresponding files.

As per the Ansible documentation for blocks:

The directive does not affect the block itself, it is only inherited by the tasks enclosed by a block. For example, a when statement is applied to the tasks within a block, not to the block itself.

For that reason exactly changing the order of execution of independent tasks in the same block is not going to change the final outcome: if the block's condition holds all tasks are going to be executed, if it doesn't - none will be executed.

A reproducer would be nice indeed.

RAttab commented 8 months ago

At the moment I don't have the time to spend on digging further into this so if you do not believe this is a problem I will simply close the PR and move on.