akrzos commented 6 years ago

This past deploy we had one ceph node end up not cleaned. We need to make sure cleaning runs absolutely against ceph nodes but other nodes with multiple disks as well. We can start this task by examining the current structure of how cleaning is initiated against nodes and investigating what would be the least failure prone method to accomplish this.

akrzos commented 6 years ago

Grepping the introspection log we can see the node that failed to clean was retried according to the script and eventually cleaning was aborted in comparison to a node that is successfully cleaned. Although cleaning had aborted, the node still succeeded in introspecting.

https://gist.githubusercontent.com/akrzos/2b84e3a5e26a86f6e63f84b2af2ca7ba/raw/6e79e8535138bbedcc612ed5fe55f62bece91f01/gistfile1.txt

akrzos commented 6 years ago

One suspect I have for this would be exhaustion of ip addresses in the introspection range. It is unclear to me if node cleaning using dhcp addresses from the introspection pool or the Overcloud dhcp pool. I am doubtful it exhausted in the case of node cleaning since it seems we attempt cleaning each serially but with a sleep of 10 seconds between each cleaning kick off and we only had 2 nodes fail cleaning (out of 83 which is almost 2x greater than the introspection address pool).

https://github.com/redhat-performance/tripleo-quickstart-scalelab/blob/master/roles/undercloud-prepare-host/templates/undercloud.11.conf.j2#L118

Either-way, we should expand this to a pool large enough to cover a qnq-1 configuration with full lab node counts (>1000 ip addresses)

akrzos commented 6 years ago

53 expanded the ctlplane ip address range but it is unclear if the actual failure here was due to exhausting the ip addresses or not.

redhat-performance / scale-ci-tripleo

Investigate and "Robustify" node cleaning #48

53 expanded the ctlplane ip address range but it is unclear if the actual failure here was due to exhausting the ip addresses or not.