microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
34 stars 27 forks source link

Invalid Container Settings injected when using private images #167

Closed maxnewbould-asterinsights closed 1 year ago

maxnewbould-asterinsights commented 1 year ago

From the 4.0.0 release deployer, using PrivateNetworking=true in a hub and spoke model, with the vnets/subnets/ACR/storage configured per private-coa.md; workflows using public container images run successfully.

When using an image from a private ACR (tested permissions, az aks check-acr, pulling using MSI on a debug pod launched w/ same AADPodIdentity), TES reports the following:

Task 75fe9035_de878743a063451ca80bec15efadb1a3 failed. ExitCode: , BatchJobInfo: {"MoreThanOneActiveJobOrTaskFound":false,"ActiveJobWithMissingAutoPool":false,"AttemptNumber":1,"NodeAllocationFailed":false,"NodeErrorCode":null,"NodeErrorDetails":null,"JobState":0,"NodeState":0,"TaskState":3,"TaskExitCode":null,"TaskExecutionResult":1,"TaskStartTime":"2023-03-23T00:11:14.226296Z","TaskEndTime":"2023-03-23T00:11:14.701518Z","TaskFailureInformation":{"Category":0,"Code":"ContainerInvalidSettings","Details":[{"Name":"ContainerSettings","Value":"--rm -v /var/run/docker.sock:/var/run/docker.sock -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR "},{"Name":"Message","Value":"Duplicate mount point: /mnt/batch/tasks"}],"Message":"At least one value of specified task container settings is invalid"},"TaskContainerState":null,"TaskContainerError":null,"Pool":{"AutoPoolSpecification":null,"PoolId":"TES-FCBVLNGS-F2s_v2-U2WWYO7JPSYF4B5CWHOIPBPBQ5R5IWD3-H3SJ4XXN"}}

Steps to Reproduce Follow private-coa.md, copy the MCR ubuntu 22.04 image to the private ACR and launch workflow using that ACR as the workflow's tasks docker setting.

Expected behavior The container should run the same as when it is not private. I'm not sure why a duplicate mount is injected.

Deployment details: (any information you can provide would be helpful):

It should be noted that the AKS cluster can launch pods using the same private ACR, and a pod with same AADPodBinding as TES can pull images without issue. Using other batch scheduling NGS tooling (nextflow), the same infrastructure runs those workflows without issue.

Thanks for your help and attention!

maxnewbould-asterinsights commented 1 year ago

Nit: private-coa still refers to CosmosDB (artifact from earlier CoA)

bencehezso commented 1 year ago

@maxnewbould-m2gen I have the same problem. Have you been able to find a workaround until it is fixed?