Closed sumitjadhav1 closed 4 years ago
Cluster need to be provisioned before creating any machines. I am not sure does this fix your problem but workflow is:
Not sure are you using CAPM3 at all and which version if you do? But this is the way it is done in v1alpha3 when using CAPM3.
@sumitjadhav1 Regarding your first issue, could you clarify your operating configuration and what configuration your supplying, as well as the command your are reporting that you have to execute? Ideally you shouldn't have to do this, but reality seems to be that some BMCs, depending on the driver, protocol, and ultimately BMC firmware, are subtly requiring different approaches, and without fully understanding all aspects in terms of context, it is difficult for us to help. In other words, if we could get more information, it would help us understand.
[2] There are two notions mixed here. The baremetal network is the network for your target cluster, i.e. what your nodes will be using for Kubernetes setup. The fact that vbmc instances were on this network is only a design decision. The constraint about vbmc nodes (and BMCs in general) is that they must be reachable from Ironic, but they do not have to be on the baremetal network. And you don't need to configure that BMC network anywhere. It will work as long as your traffic from Ironic to BMC is routed properly.
Now the error you are seeing is unrelated to the BMCs and Ironic. "failed to create remote cluster client: failed to create client for workload cluster metal3/test1: Get https://192.168.111.249:6443/api?timeout=30s: dial tcp 192.168.111.249:6443: connect: no route to host" means that CAPI or CAPM3 is trying to talk to your target cluster after provisioning. 192.168.111.249:6443 is the load-balancer port to reach the API server. If your provisioning was not successful, or your deployment failed, then that load balancer won't be reachable and you will get this error. But it is unrelated to BMCs.
@sumitjadhav1 Regarding your first issue, could you clarify your operating configuration and what configuration your supplying, as well as the command your are reporting that you have to execute? Ideally you shouldn't have to do this, but reality seems to be that some BMCs, depending on the driver, protocol, and ultimately BMC firmware, are subtly requiring different approaches, and without fully understanding all aspects in terms of context, it is difficult for us to help. In other words, if we could get more information, it would help us understand.
Hi @juliakreger ,
Please let me know if you need any additional info.
1. you should not mix provision_host.sh script and CAPM3 based scripts. both are doing more or less the same thing under the hood (provisioning a node). If you want to use CAPI and CAPM3, use the scripts under ./scripts/v1alphaX only once your environment is ready (BMHs ready) 2. This is the right approach. Did you try to apply your workaround again ? maybe that is something you need to do every time ?
Thanks for clarification. Yes we're using correct approach 2 now (BMH Ready and then use cluster scripts).
Unfortunately we can't apply workaround runtime, because node in ironic is locked for any updates when provisioning/deployment has already started.
[2] There are two notions mixed here. The baremetal network is the network for your target cluster, i.e. what your nodes will be using for Kubernetes setup. The fact that vbmc instances were on this network is only a design decision. The constraint about vbmc nodes (and BMCs in general) is that they must be reachable from Ironic, but they do not have to be on the baremetal network. And you don't need to configure that BMC network anywhere. It will work as long as your traffic from Ironic to BMC is routed properly.
Now the error you are seeing is unrelated to the BMCs and Ironic. "failed to create remote cluster client: failed to create client for workload cluster metal3/test1: Get https://192.168.111.249:6443/api?timeout=30s: dial tcp 192.168.111.249:6443: connect: no route to host" means that CAPI or CAPM3 is trying to talk to your target cluster after provisioning. 192.168.111.249:6443 is the load-balancer port to reach the API server. If your provisioning was not successful, or your deployment failed, then that load balancer won't be reachable and you will get this error. But it is unrelated to BMCs.
Thanks for this clarification as well. Okay so these errors are expected till the provisioning/deployment fails which is the blocker for us now. Appreciate help on this front.
- Due to a iDRAC firmware bug when using redfish, user must set "force_persistent_boot_device=Never" (openstack baremetal node set --driver-info force_persistent_boot_device=Never ) before starting node deployment. We had applied this workaround and was able to deploy node successfully (Node in Active state in Ironic/Provisioned state in Metal3)
Does that flag change ironic's behavior or does ironic pass it to the BMC to change the hosts's behavior?
@dhellmann Changes ironic's behavior in not asserting persistent boot flags. The bug @sumitjadhav1 is speaking of where the BMC receives the flag it returns an error unexpectedly. That being said, I've heard from my dell contacts that the bug is expected to be fixed in the very next iDRAC firmware release since it previously worked just fine.
@dhellmann Changes ironic's behavior in not asserting persistent boot flags. The bug @sumitjadhav1 is speaking of where the BMC receives the flag it returns an error unexpectedly. That being said, I've heard from my dell contacts that the bug is expected to be fixed in the very next iDRAC firmware release since it previously worked just fine.
Thanks for the details, @juliakreger .
I'm not sure how much work we should do in metal3 to work around bugs in firmware. I wouldn't, for example, want to expose an API to let the user control the force_persistent_boot_device
flag explicitly. I could see us always setting the flag to false
, but I don't know what side-effects we might end up with from that so I wouldn't want to take that step lightly.
Update : 10/04/2020
This activity was on-hold for couple of weeks (due to hardware-classification-controller internal testing), will try to update once we have results. Currently reading the design in https://github.com/metal3-io/metal3-docs/pull/78 for more details (as suggested in last community meeting).
Now able to provision Dell PowerEdge Server nodes using cluster scripts provided by Metal3. Also we used IPAM feature. Testing done for nodes in both UEFI/BIOS boot modes. This verification was done after following fixes :
However, currently we're observing issues in Node de-provisioning step using deprovision_worker script (node going in clean_failed state). We shall perform couple of rounds for provisioning & de-provisioning, if found consistently reproducible, will create a new issue.
Hence closing this issue now.
[1] We are currently testing the Metal3/BMO and Cluster Provisioning on Dell PowerEdge Servers. We have successfully completed till Node Introspection (BM Nodes in Ready state) step. Now next step is BM Node provisioning and Cluster Creation. We are using two approaches as follows :
First Node Provisioning using provision_host.sh script and create cluster using cluster scripts : -> Here BM Nodes state changes from Ready to Provisioned successfully (Workaround : We need to configure one boot device parameter in Ironic using CLI before starting Node Provisioning) -> After this we start with Cluster Scripts available under ~/metal3-dev-env/scripts/v1alphaX -> We're facing issue here that none of available BM Nodes is picked for Cluster initiation process (to be made Master node) -> Our doubt : Is this the right approach for Cluster Creation and Provisioning
BM Nodes in Ready state and use Cluster scripts : -> Here BM Nodes are in Ready state post successful introspection. We also apply Workaround as stated above in approach 1 for Ironic for all available Nodes. -> Now we use Cluster scripts for cluster creation and it picks one of the available nodes for Node Provisioning (to change from Ready to Provisioned state) -> This operation gets stuck at Ironic with error for which we have already applied the Workaround at the beginning. -> Our doubt : Node Provisioning works in approach-1 whereas here it fails with error at Ironic even though we have Workaround in place.
[2] Another doubt regarding Baremetal Network (192.168.111.0/24), which is used for vBMC nodes created in Metal3 setup. -> Do we need to change this IP/Range (Cluster API Endpoint is defined in this range) in lib/common.sh file to use iDRAC network available for Dell hardware. Observed below error in our logs :
"failed to create remote cluster client: failed to create client for workload cluster metal3/test1: Get https://192.168.111.249:6443/api?timeout=30s: dial tcp 192.168.111.249:6443: connect: no route to host"
-> Need help on this front.