microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
169 stars 133 forks source link

Resource processor fails to deploy first workspace on fresh TRE deployment #3950

Closed jonnyry closed 1 month ago

jonnyry commented 1 month ago

I deployed a new TRE from current main, and the first workspace I attempted to create fails (I've noticed this on fresh deploys a couple of times now) - the resource processor fails to deploy the workspace with the following message:

1) Main step for 88350ad0-8701-4dee-9698-834c9af1c6c8
88350ad0-8701-4dee-9698-834c9af1c6c8: Error message: parameter "tre_id" is required ; Command executed: porter install "88350ad0-8701-4dee-9698-834c9af1c6c8" --reference XXXXXXXX.azurecr.io/tre-workspace-nwsde-data-engineering:v1.6.1 --force --credential-set arm_auth --credential-set aad_auth

Creation of subsequent workspaces all succeed.

Here's the logs for the failed run:

image

I'm pretty sure the get_porter_parameter_keys function is failing the first time around, and specifically on the porter explain line:

https://github.com/microsoft/AzureTRE/blob/ddbbffe70fc6a8fe5d0b430afc4c18116f7ff993/resource_processor/resources/commands.py#L107

It looks like az acr login has not been called, and hence its causing the registry server to deny the request. However porter install is still called despite not building the parameters, where az acr login is called - which is why subsequent runs work.

Looking at the commit history I can see the az login / az acr login were previously called before running porter explain:

image

Commit:

https://github.com/microsoft/AzureTRE/commit/c382f3daa041f337455ec47fef24eedad5ce55e6

Wondering if the az login & az acr login should have remained before calling porter explain?

jonnyry commented 1 month ago

You can recreate this bug without having to deploy a fresh TRE by:

marrobi commented 1 month ago

I can remember looking at this at the time. Weird you have seen it as was a while ago and don't think I've come across the issue and our E2E PR tests would fail. So I'm confused why seeing this now, and not in the tests.

Looking at the code needs running once on RP start-up and is done here - https://github.com/microsoft/AzureTRE/blob/ddbbffe70fc6a8fe5d0b430afc4c18116f7ff993/core/terraform/resource_processor/vmss_porter/cloud-config.yaml#L91

Looking at your logs I think your actual error is Error message: parameter "tre_id" is required. Is this a custom bundle, if so think you are missing passing tre_id somewhere.

Danny-Cooke-CK commented 1 month ago

I've see this too recently

tim-allen-ck commented 1 month ago

@jonnyry I think I've seen this before when deploying a workspace, as as you say subsequent deploys work, thats been our "workaround".

jonnyry commented 1 month ago

@jonnyry I think I've seen this before when deploying a workspace, as as you say subsequent deploys work, thats been our "workaround".

yes - also our workaround :-) just thought i'd get it logged as seen it several times now

jonnyry commented 1 month ago

Looking at the code needs running once on RP start-up and is done here -

https://github.com/microsoft/AzureTRE/blob/ddbbffe70fc6a8fe5d0b430afc4c18116f7ff993/core/terraform/resource_processor/vmss_porter/cloud-config.yaml#L91

I notice the az acr login is run on the VM itself rather than inside the resource processor docker container - is the az "session" shared inside the docker container?

  - az acr login --name ${docker_registry_server}
  - docker run -d -p 8080:8080 -v /var/run/docker.sock:/var/run/docker.sock
    --restart always --env-file .env
    --name resource_processor1
    --log-driver local
    ${docker_registry_server}/${resource_processor_vmss_porter_image_repository}:${resource_processor_vmss_porter_image_tag}

Looking at your logs I think your actual error is Error message: parameter "tre_id" is required. Is this a custom bundle, if so think you are missing passing tre_id somewhere.

The logs in the issue description are for a custom bundle, however it also happens for standard bundles, this is from a test I ran just now after resetting the cache & docker credentials inside the resource processor container -

1) Main step for 28b8b4b2-8840-4eac-89d9-ab6294ac1aa2
28b8b4b2-8840-4eac-89d9-ab6294ac1aa2: Error message: parameter "address_spaces" is required ; Command executed: porter install "28b8b4b2-8840-4eac-89d9-ab6294ac1aa2" --reference XXXXX.azurecr.io/tre-workspace-airlock-import-review:v0.12.16 --force --credential-set arm_auth --credential-set aad_auth
jonnyry commented 1 month ago

It looks like az login & az acr login are called when running a constructed porter command (install etc):

https://github.com/microsoft/AzureTRE/blob/ddbbffe70fc6a8fe5d0b430afc4c18116f7ff993/resource_processor/vmss_porter/runner.py#L100-L109

But not when calling porter explain, prior to the above code running:

https://github.com/microsoft/AzureTRE/blob/ddbbffe70fc6a8fe5d0b430afc4c18116f7ff993/resource_processor/resources/commands.py#L106-L107

marrobi commented 1 month ago

Docker is passed through the container so the creds should pass through. I'm sure I tested it.

But, yes, fix if it is an issue is to add the login commands to to the explain command.

jonnyry commented 1 month ago

OK just checking the creds on the VM and inside the resource processor container... the two are not the same, at least on my instance :-D

image