terraform workspace list/select doesn't always find existing workspace

ghost commented 2 years ago

Terraform Version

v1.0.1 on windows_amd64

AzureRM Provider Version

3.1.0

Affected Resource(s)/Data Source(s)

workspace

Terraform Configuration Files

We have 43 workspaces (including default which we don't use), test1a, test1b ... test1f through to test7f.
State files are saved in an Azure container along with state files for other resources.

Debug Output/Panic Output

I run terraform workspace list or terraform workspace select but it can't always find an existing workspace. 

See gist for output
https://gist.github.com/beetimelywork/d8ada2b065d41a60ec9e1a6d56fd89e3
https://gist.github.com/beetimelywork/60f4310e0119c2bfabdaf83498191d4c

Expected Behaviour

If I run terraform workspace list I'd expect all 42 workspaces to be listed If I run terraform workspace select for any of the 42 workspaces, I'd expect it to be selected.

Actual Behaviour

As per the gist, the behavior is intermittent and sometimes correctly lists/selects the appropriate workspaces and sometimes doesn't.

This is causing our Pipelines which run terraform workspace select to intermittently fail.

Steps to Reproduce

Create the 42 workspaces as per gist.

Using a Powershell console:

Authenticate with Azure
terraform init
terraform workspace select or
terraform workspace new repeat Notice the output/ behavior isn't always the same

Important Factoids

No response

References

No response

tombuildsstuff commented 2 years ago

Transferring this to the Core repository since this is related to the AzureRM Backend and not the AzureRM Provider

apparentlymart commented 2 years ago

I'm not really familiar with this particular backend, since the Azure provider team typically maintains it, but this does seem similar to a problem we had with other backends in the past where we didn't realize that the list API had some upper limit on number of items returned after which you need to make further requests to load additional "pages" of information. I wonder if something like that is going on here too.

apparentlymart commented 2 years ago

Since I had a little more time to spend on this today I took a look to see if my theory above seemed plausible.

The portion of code relevant to what I was describing is the Workspaces method of the Azure backend implementation. It's responsible for doing whatever API calls are necessary for that particular backend and then returning a flat list of available namespaces.

In the case of Azure that seems to use an API operation called List Blobs. This operation does seem to have an upper limit on results, but the documentation says that the upper limit is 5,000 items and that it will use that limit as the default value if no explicit limit is set in the request.

The containers.ListBlobsInput object constructed in the implementation does not populate the MaxResults field at all, causing it to be nil when passed to containers.ListBlobs. That should cause the client layer to omit that argument altogether in the request, and so the current implementation of the backend seems like it would produce a correct result for up to 5,000 workspaces. That's considerably more than the 43 described in the writeup.

However, I do notice that the backend code subsequently does client-side filtering of the result to skip over any object that doesn't have the specified prefix. That seems odd since the request already specified the prefix and so presumably the server should already have filtered out the items without that prefix, but it makes me wonder if it's possible that a bucket with a large number of items that don't match the prefix could potentially starve out those which do and thus still cause a short read, if one of the relevant items still ends up at an offset greater than 5,000 prior to filtering. :thinking:

With all of that said then: it seems that this disproves my original guess, but I think it could help to understand why the backend code does post-filtering of a list that seems like it should already be filtered, and whether that underlying cause could lead to unrelated blobs in the container pushing us over the 5,000 item response limit.

I have gone as far as I can with this because any further investigation would require access to Azure, so I'll leave this here for someone on the Azure provider team to tackle, as usual.

SachinL9 commented 2 years ago

We are facing similar issue (There are 47 workspaces)

Any updates if the issue is fixed or if there are any workaround?

mattsears18 commented 1 year ago

I'm also having this problem now. Only 3 of my 8 workspaces are listed.

terraform v1.2.3
darwin_arm64
aws provider v4.18.0

mattsears18 commented 1 year ago

Update: I was able to get all of my workspaces to show up by running terraform workspace new <workspace name> for each workspace.

hashicorp / terraform