rancher / terraform-provider-rancher2

Terraform Rancher2 provider
https://www.terraform.io/docs/providers/rancher2/
Mozilla Public License 2.0
253 stars 216 forks source link

[BUG] #1314

Open matttrach opened 4 months ago

matttrach commented 4 months ago

Rancher Server Setup

Information about the Cluster

User Information

Provider Information

Describe the bug

When generating an air-gapped downstream vSphere cluster, provisioning is stuck on downloading image from the internet. The image is specifically boot2docker from rancheros-vmware. The image is provided in boot2docker_url attribute in node template (rancher2_node_template). It seems like the value for the creation_type attribute must be "template" for this use case. The default value for this attribute is "legacy", so is creation_type now mandatory?

creation_type - (Optional) Creation type when creating a new virtual machine. Supported values: vm, template, library, legacy. Default legacy. From Rancher v2.3.3 (string)

To Reproduce

Generate vSphere cluster using rancher2_node_template.

Actual Result

error downloading the image, failed to provision

Expected Result

no error, provisioning works with default options

matttrach commented 4 months ago

information from the Rancher docs: https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/downstream-cluster-configuration/node-template-configuration/vsphere#about-vm-creation-methods

matttrach commented 4 months ago

It has a couple of warnings there:

Ensure that the OS ISO URL field contains the URL of a VMware ISO release for RancherOS (rancheros-vmware.iso). Note that this URL must be accessible from the nodes running your Rancher server installation

matttrach commented 4 months ago

@abhishekhpatil10, would you mind confirming this information from the user?

abhishekhpatil10 commented 4 months ago

Thanks Matt, Yes the code has worked for the user in past and it worked now after second run. They saw the error only during first run. Their Rancher server cannot access that URL. It is not accessible from any node.

matttrach commented 4 months ago

Sounds like this is covered in the docs then, the Rancher server installation needs to be able to access the URL. Glad to hear everything is working or them now. Please let me know if anything else is necessary.

abhishekhpatil10 commented 4 months ago

The only question the user have is, is it mandatory to mention creation_type = "template" in the code?

matttrach commented 3 months ago

I am so sorry, I misinterpreted. Here is the code defining the structure for that argument: https://github.com/rancher/terraform-provider-rancher2/blob/v4.1.0/rancher2/schema_node_template_vsphere.go#L101-L107 You can see it is set as optional with a default value.

Optional:     true,
Default:      vmwarevsphereConfigCreationTypeDefault,
matttrach commented 3 months ago

@abhishekhpatil10 please let me know if there is anything else I can help with. If not, I will close this issue one week from now on 3/20/2024.

Martin-Weiss commented 3 months ago

Any idea why we had the failure only during the first run but it has been working with every further run and the change for the template? Regarding 100% airgapped - nothing changed.

And it seems the documentation might not be correct - as it is working since the second run and this change..

matttrach commented 3 months ago

well, it could be a cache miss, it could be a dropped connection, it could be filesystem io errors, it could be a random bug in vsphere, it could be a timeout in the vsphere api, it could be many things... I know companies often want in depth post mortem and RCAs, but that will need to be conducted by someone on their side with full access to the information involved

I wish I could give more, sorry!

matttrach commented 3 months ago

This could be the change that they are experiencing: https://github.com/rancher/terraform-provider-rancher2/commit/ed867957635a56042cad7658f0ef1c8220a7ec46 or it could be this one: https://github.com/rancher/terraform-provider-rancher2/commit/6f97f5e3499cd9b7b73340796a7cd3e4b8d0ee9b

Both of these are non-breaking additions to that template which enable changes which occurred in the Rancher API.

Martin-Weiss commented 3 months ago

In our case the problem was that the provision job created by rancher did get stuck in trying to download the iso. This should not even been tried as we are 100% air-gapped and I am not sure if the change in terraform has been causing the problem to go away or if it was something in rancher that is different on a second run.. I really believe we need QA to test 100% air-gapped scenarios properly

matttrach commented 2 months ago

Terraform providers should not alter the experience of the Rancher API, this tool enables programmatic control of the API around the context of an "object" (because an object is typically the context for multiple REST endpoints). The programmatic access specifically focuses on allowing a workflow which developers find most comfortable with version control and CI/CD (thus Config As Code).

It sounds to me like the user is unhappy with how Rancher behaved given the inputs that they gave and the environment that they are in. I agree that Rancher should not attempt to look for an image and have to wait for a timeout when in airgapped situations and the image is not available. This behavior could be occurring from how Rancher is configured or from a missing feature in Rancher, or from a bug in the Rancher code (if the behavior shouldn't be happening).

I see a few solutions to this:

Neither of these solutions are changes in how this repo is written, tested, or deployed. As a maintainer of this project and other Terraform projects I can look into the module if this is something the user would like, but as I said before it would take time and effort before something like this is available.

matttrach commented 2 months ago

This was discussed further in an internal issue, and the resolution that came to would be to change the default value in the provider to match the default set in Rancher. Since this is a breaking change for users who rely on the current default we are going to move this change for the next major version of the provider. This would be version 5.x of the provider.

matttrach commented 2 months ago

This issue will track the progress of the change.