Open akrzos opened 6 years ago
Grepping the introspection log we can see the node that failed to clean was retried according to the script and eventually cleaning was aborted in comparison to a node that is successfully cleaned. Although cleaning had aborted, the node still succeeded in introspecting.
One suspect I have for this would be exhaustion of ip addresses in the introspection range. It is unclear to me if node cleaning using dhcp addresses from the introspection pool or the Overcloud dhcp pool. I am doubtful it exhausted in the case of node cleaning since it seems we attempt cleaning each serially but with a sleep of 10 seconds between each cleaning kick off and we only had 2 nodes fail cleaning (out of 83 which is almost 2x greater than the introspection address pool).
Either-way, we should expand this to a pool large enough to cover a qnq-1 configuration with full lab node counts (>1000 ip addresses)
This past deploy we had one ceph node end up not cleaned. We need to make sure cleaning runs absolutely against ceph nodes but other nodes with multiple disks as well. We can start this task by examining the current structure of how cleaning is initiated against nodes and investigating what would be the least failure prone method to accomplish this.