microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
184 stars 141 forks source link

guac linuxvm status: "deployment failed" on install in workspace after upgrading bundle w/ make user_resource_bundle locally #3823

Closed m1p1h closed 9 months ago

m1p1h commented 9 months ago

I'm trying to test an upgrading to the linuxvm user resource by running the following make locally:

make user_resource_bundle BUNDLE=guacamole-azure-linuxvm WORKSPACE_SERVICE=guacamole

The make script completes successfully. However, when I try to deploy a linuxVM in a workspace with the upgraded version I get a deployment_failed error:

: 4bd2d574-628c-4bff-8372-e26cc4dc876c: Error message: Unable to find image '***.azurecr.io/tre-service-guacamole-linuxvm@sha256:2d7aa9e5c8318941f02dd57e7975e29a502c9ae7242a9f24fee30260165646b8' locally exec /cnab/app/run: exec format error 2 errors occurred: * container exit code: 1, message: <nil>. fetching outputs failed: error copying outputs from container: Error response from daemon: Could not find the file /cnab/app/outputs in container a5c93eddd67a154087f7f979a6bc68ec31fbcc6d4222e7b91c29290bac7597e6 * required output hostname is missing and has no default ; Command executed: az cloud set --name AzureCloud && az login --identity -u bba7efea-eed7-4319-8695-24c61d9dc0c4 && az acr login --name ***acr && porter install "4bd2d574-628c-4bff-8372-e26cc4dc876c" --reference ***acr.azurecr.io/tre-service-guacamole-linuxvm:v0.8.0 --param arm_environment="public" --param arm_use_msi="true" --param azure_environment="AzureCloud" --param id="4bd2d574-628c-4bff-8372-e26cc4dc876c" --param os_image="Ubuntu 18.04" --param parent_service_id="ba82e7c8-f5ce-4abb-994c-86bfaeb501cf" --param shared_storage_access="True" --param tfstate_container_name="tfstate" --param tfstate_resource_group_name="rg-***-mgmt" --param tfstate_storage_account_name="***mgmtstore" --param tre_id="***" --param vm_size="2 CPU | 8GB RAM" --param workspace_id="2954681e-b9fd-4551-b15b-ae1cbc4ca9d2" --force --credential-set arm_auth --credential-set aad_auth

For this test, I don't actually change the linuxvm code apart from the version number in the porter.yaml file. The porter bundle / image are definitely in the acr, in this case v0.8.0. The odd thing is a colleague can run the same make locally and deploy the linuxvm CNAB with a later version number to the previous one i deployed into the same AzureTRE instance and everything works in the sense that we can successfully deploy a linuxvm from that upgraded user resource.

I have the same permissions as they do. The only difference we can see is that I'm running the azureTRE devcontainer on a mac and they're on windows which shouldnt impact on anything.

Any suggestions on what else we might want to consider?

m1p1h commented 9 months ago

I completely removed the existing TRE instance and did a 'make all' to start a fresh. It got as far as deploying the shared firewall bundle and then gave the same error as before:

Error message: Unable to find image '***acr.azurecr.io/tre-shared-service-firewall@sha256:50ec34d66b6297f4de21ca7bf73bb9828791fc71f55fcbb8c41c0d580b986b5c' locally exec /cnab/app/run: exec format error 2 errors occurred: * container exit code: 1, message: <nil>. fetching outputs failed: error copying outputs from container: Error response from daemon: Could not find the file /cnab/app/outputs in container 1b2a0af2e5f4ea4da6287dd88bdbc5aea83947c7239f4560795564e4106d10a4 * required output porter-state is missing and has no default ; Command executed: az cloud set --name AzureCloud && az login --identity -u 0e87ef80-6ba8-4887-a998-d1f4d9131075 && az acr login --name ***nwsdedevacr && porter install "873e7177-d3a7-4c29-8c92-9f5cd8749bed" --reference ***devacr.azurecr.io/tre-shared-service-firewall:v1.1.5 --param arm_environment="public" --param arm_use_msi="true" --param id="873e7177-d3a7-4c29-8c92-9f5cd8749bed" --param microsoft_graph_fqdn="graph.microsoft.com" --param tfstate_container_name="tfstate" --param tfstate_resource_group_name="rg-***dev-mgmt" --param tfstate_storage_account_name="***devmgmtstore" --param tre_id="***dev" --force --credential-set arm_auth --credential-set aad_auth

I am running make all from the devcontainer and my dev machine has a M2 chip. I can build and publish bundles ok. But unable to deploy them.

marrobi commented 9 months ago

Hi @m1p1h , I'll try and validate the first issue, user VM upgrades aren't something that happen very often as is a risk the VM will be replaced. What's the upgrade scenario?

For both issues, is there anything in the API logs? ( https://microsoft.github.io/AzureTRE/latest/troubleshooting-faq/app-insights-logs/ ) Also what release/branch are you deploying from?

Thanks.

m1p1h commented 9 months ago

Hi @marrobi thanks for looking into this. I'm part of the nwsde dev team and I need a way to test changes locally within a running TRE instance (ideally without having to rebuild the whole thing every time). So in the initial case, I'm looking to update the windowsvm and linuxvm user resources. But I think there's a bigger issue in that, I'm finding any CNAB I build and deploy from my local machine I experience the error above. A colleague running the same make with the same code doesn't experience this issue where the resource_processor appears to be unable to see the bundles in the ACR even though I can see via the portal that they exist and are registered correctly in cosmos (by checking via the API).

My dev machine is a Mac Air (M2 chip) running Sonoma (14.2.1) with Docker Desktop (4.26.1, Engine: 24.0.7, using Rosetta emulation).

I'm running AzureTRE release 0.16.0.

The logs don't give much else (see attached). query_data.csv

marrobi commented 9 months ago

Ok, so the "Unable to find image ... locally" is standard, Docker always shows this if an image does not exist locally, then pulls the image from the Docker Registry.

@jjgriff93 @martinpeck any comments on the Mac setup? (I run Windows and WSL so can't comment).

A seperate tip though, if iterating locally, it's often possible to just deploy the terraform (as long as no VNet access is required), which means can iterate faster. You need to ensure is a deploy.sh script, and .env file is correct, but can use make terraform-deploy

marrobi commented 9 months ago

@m1p1h those logs are all resource processor logs, not API logs, can you check with AppRoleName set to API when searching the logs. Thanks.

m1p1h commented 9 months ago

Here's the logs for api and resource_processor... query_data(2).csv

marrobi commented 9 months ago

Doesn't give much away. can you set debug on the API and try again? https://microsoft.github.io/AzureTRE/latest/troubleshooting-faq/debug-api/

The other thing would be to jump onto the resource processor and watch the logs as the bundle is installed, see "Logs" at https://microsoft.github.io/AzureTRE/latest/troubleshooting-faq/troubleshooting-rp/

jjgriff93 commented 9 months ago

Hello - I recall having an issue with the resource processor not picking up messages when it was bundled and deployed from my machine (M1) but never got to the bottom of it - I generally used Codespaces (Linux) for working on the TRE. However, when I was experimenting with this I remember getting further when using QEMU and getting docker to build Linux/amd64 images so you could give this a try.

You can do this by modifying the Dockerfile.tmpl of the bundle your building and deploying from:

FROM debian:bullseye-slim

to

FROM --platform=linux/amd64 debian:bullseye-slim
marrobi commented 9 months ago

Hmm, I think that is covered in build.sh:

ARCHITECTURE=$(uname -m)

if [ "${ARCHITECTURE}" == "arm64" ]; then
    DOCKER_BUILD_COMMAND="docker buildx build --platform linux/amd64"
else
    DOCKER_BUILD_COMMAND="docker build"
fi

@m1p1h what does uname -m return on your machine?

m1p1h commented 9 months ago

Returns arm64. I can also see that docker buildx is being used.

m1p1h commented 9 months ago

Although looking at the built images it does look like for some bundles the arch is still being built as arm64...

Resource processor does get built for amd64 arch but for tre-shared-service-firewall (where the error happens) its being built as arm64

sha256:e87f74ca090dfe5d43ac71e18fbffa13c7c69f79d565ec97344f1c631701a89b [] amd64
sha256:21ff037dcd01b258762df05946ff7d254a45292a60abf5c13395e7c6e29e6cbe [***acr.azurecr.io/microsoft/azuretre/resource-processor-vm-porter:0.7.1] amd64
sha256:a4ef98b87bbc78fed1a9c6e9f53dac014becb106882259567a9fe0cd1405d4cd [] amd64
sha256:6fff07427dc49e31a62d034e33635c680a0ad9349d3967cb8343e87827dff196 [***acr.azurecr.io/microsoft/azuretre/api:0.16.9] amd64
sha256:f4a266371ab5aa46a914d278dd79f43c037947c76c06e88afba82c5773ce2511 [azuretre/tre-shared-service-firewall:porter-1cd9bd2322c189592902d3951c82358c ***acr.azurecr.io/tre-shared-service-firewall:porter-a0b51da0df4008e46e7def3ae3a54624] arm64
sha256:f4a266371ab5aa46a914d278dd79f43c037947c76c06e88afba82c5773ce2511 [azuretre/tre-shared-service-firewall:porter-1cd9bd2322c189592902d3951c82358c nwsdedevacr.azurecr.io/tre-shared-service-firewall:porter-a0b51da0df4008e46e7def3ae3a54624] arm64
m1p1h commented 9 months ago

Hmm, I think that is covered in build.sh:

ARCHITECTURE=$(uname -m)

if [ "${ARCHITECTURE}" == "arm64" ]; then
    DOCKER_BUILD_COMMAND="docker buildx build --platform linux/amd64"
else
    DOCKER_BUILD_COMMAND="docker build"
fi

@m1p1h what does uname -m return on your machine?

In cli/scripts/build.sh I dont see this. Am I looking in the wrong place?

marrobi commented 9 months ago

I searched a completely different repo for a different project!

Maybe that's the solution we need... in /devops/scripts/bundle_runtime_image_build.sh

marrobi commented 9 months ago

@m1p1h let us know if that works, be great if you can do a PR - as I'm on Windows can't test.

However think I've found a related bug here - https://github.com/microsoft/AzureTRE/issues/3824

m1p1h commented 9 months ago

@marrobi will do. I did quickly try with code above in bundle_runtime_image_build.sh but it still built some bundle images as arm64 archs resutling in the same error. But might be because in the devcontainer 'uname -m' returns x86_64.

marrobi commented 9 months ago

Hmm, interesting. Wonder if a way to find the Docker architecture.

m1p1h commented 9 months ago

I think we can use docker info --format '{{ .Architecture }}' which will give aarch64 for docker instances running on arm64.

m1p1h commented 9 months ago

Just to confirm the change to the docker build would work i hardcoded the build command in devops/scripts/bundle_runtime_image_build.sh to:

docker buildx build --platform linux/amd64 --build-arg BUILDKIT_INLINE_CACHE=1 \
  -t "${FULL_IMAGE_NAME_PREFIX}/${image_name}:${version}" \
  "${docker_cache[@]}" -f "${docker_file}" "${docker_context}"

But I still see the same issue where some bundles (tre-shared-service-firewall) are still built with an arm64 arch. This would suggest some bundle images are being built elsewhere? I notice in the Makefile there is a 'build_image' function defined but looks like that is used to build the api, resource processor and airlock processors only.

sha256:001f66786a65ed20a8353dd73c613b996c087de99a61acea83638b08d6d02385 [***acr.azurecr.io/microsoft/azuretre/airlock-processor:0.7.0] amd64
sha256:1c7196ff44bd2eec2bafe43732b68d57ec3cecd4191861f2c8fa37673aae2e94 [***acr.azurecr.io/microsoft/azuretre/resource-processor-vm-porter:0.7.1] amd64
sha256:1b4c2d41d42075c489cf80eac1796e1efdba4e6e058c400a04efb300acbe27da [***acr.azurecr.io/microsoft/azuretre/api:0.16.9] amd64
sha256:f4a266371ab5aa46a914d278dd79f43c037947c76c06e88afba82c5773ce2511 [azuretre/tre-shared-service-firewall:porter-1cd9bd2322c189592902d3951c82358c ***acr.azurecr.io/tre-shared-service-firewall:porter-a0b51da0df4008e46e7def3ae3a54624] arm64
sha256:f4a266371ab5aa46a914d278dd79f43c037947c76c06e88afba82c5773ce2511 [azuretre/tre-shared-service-firewall:porter-1cd9bd2322c189592902d3951c82358c ***acr.azurecr.io/tre-shared-service-firewall:porter-a0b51da0df4008e46e7def3ae3a54624] arm64
marrobi commented 9 months ago

Thinking about it, the --platform linux/amd64 does not need to be conditional, no reason that can't be used every time docker buildx build is run.

Ah, actually its porter build that builds the bundles, not docker, wonder if that has architecture options.

marrobi commented 9 months ago

Looks like answer is need to add to each porter bundle's Dockerfile - https://github.com/getporter/porter/issues/2021#issuecomment-1195738012

Can you try this with templates/shared_services/firewall/Dockerfile.tmpl ?

marrobi commented 9 months ago

Actually, this is caused by us having custom Dockerfile.tmpl files that don't specify platform.

m1p1h commented 9 months ago

@marrobi, could you give me access to create a branch?

marrobi commented 9 months ago

@m1p1h you need to create a fork. Then a PR back if all is good. Thanks.