microsoft / CromwellOnAzure

Microsoft Genomics implementation of the Broad Institute's Cromwell workflow engine on Azure
MIT License
135 stars 55 forks source link

Issues with using ACR docker images with broadinstitute/cromwell #363

Open jlester-msft opened 2 years ago

jlester-msft commented 2 years ago

When using an ACR url to host a docker image there are two interesting side effects:

  1. Call caching is apparently disabled by Cromwell, because it cannot get a hash for the ACR docker image, and the Cromwell docker image reports a warning
  2. metadata output no longer contains 'compressedDockerSize' key/value pair

Call caching behavior with ACR docker images

To reproduce this, run a WDL with a docker URL pointing to an ACR. You can then see warning output inside the broadinstitute/cromwell image warning about the registry not being supported.

According to this issue: https://github.com/broadinstitute/cromwell/issues/4171 and this one https://github.com/broadinstitute/cromwell/issues/4912 "Docker lookup failed" seems to indicate that Cromwell could not get the hash for the docker image and one was not provided. It appears that Cromwell will then not support call-caching on this task because it doesn't have a hash to verify the image.

To see the output, run the WDL with the ACR docker image.

You will find the warning on the line saying Docker lookup failed and a note about the specified ACR (wdltest.azurecr.io in the example below) not being supported. Using images from us.gcr.io seems to not cause this warning or call-caching issue.

2022-04-10 19:30:22,809 cromwell-system-akka.dispatchers.engine-dispatcher-19064 WARN  - BackendPreparationActor_for_4c15080c:VariantCalling.ValidateVCF:-1:1 [UUID(4c15080c)]: Docker lookup failed
java.lang.Exception: Registry wdltest.azurecr.io is not supported
        at cromwell.engine.workflow.WorkflowDockerLookupActor.cromwell$engine$workflow$WorkflowDockerLookupActor$$handleLookupFailure(WorkflowDockerLookupActor.scala:222)
        at cromwell.engine.workflow.WorkflowDockerLookupActor$$anonfun$3.applyOrElse(WorkflowDockerLookupActor.scala:88)
        at cromwell.engine.workflow.WorkflowDockerLookupActor$$anonfun$3.applyOrElse(WorkflowDockerLookupActor.scala:73)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at akka.actor.FSM.processEvent(FSM.scala:707)
        at akka.actor.FSM.processEvent$(FSM.scala:704)
        at cromwell.engine.workflow.WorkflowDockerLookupActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowDockerLookupActor.scala:40)
        at akka.actor.LoggingFSM.processEvent(FSM.scala:847)
        at akka.actor.LoggingFSM.processEvent$(FSM.scala:829)
        at cromwell.engine.workflow.WorkflowDockerLookupActor.processEvent(WorkflowDockerLookupActor.scala:40)
        at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:701)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:695)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at cromwell.docker.DockerClientHelper$$anonfun$dockerResponseReceive$1.applyOrElse(DockerClientHelper.scala:16)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
        at akka.actor.Actor.aroundReceive(Actor.scala:539)
        at akka.actor.Actor.aroundReceive$(Actor.scala:537)
        at cromwell.engine.workflow.WorkflowDockerLookupActor.aroundReceive(WorkflowDockerLookupActor.scala:40)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
        at akka.actor.ActorCell.invoke(ActorCell.scala:583)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
        at akka.dispatch.Mailbox.run(Mailbox.scala:229)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

It is unclear if call-caching can be enabled if you specify a hash with an ACR image. Some of the older issues on the Cromwell GitHub suggest that adding a hash won't re-enable call-caching.

Metadata missing compressedDockerSize

A minor side effect is that the metadata output would typically list the size of the Docker image pulled in the metadata. Because the registry is unsupported this appears to be missing. So instead of seeing a key/value pair like "compressedDockerSize": 1252162828, there will be no key named compressedDockerSize

Note when specifying a us.gcr.io image the compressedDockerSize is present in the metadata output and populated. And there is no warning shown in the docker logs. Even when the us.gcr.io image does not specify a hash. Suggesting that Cromwell is able to support us.gcr.io registry images but not azurecr.io ones.

Images used to test this behavior were:

docker: 'us.gcr.io/broad-gotc-prod/verify-bam-id:c1cba76e979904eb69c31520a0d7f5be63c72253-1553018888'
Reported size: "compressedDockerSize": 378822083,

docker: 'us.gcr.io/broad-gotc-prod/picard-cloud:2.23.8'
Reported size: "compressedDockerSize": 993441717,
jlester-msft commented 2 years ago

This appears to be a missing feature in Cromwell to support Cromwell on Azure, see https://github.com/broadinstitute/cromwell/blob/develop/docs/cromwell_features/CallCaching.md For more information on how to support docker lookup and the call caching behaviors.

https://github.com/broadinstitute/cromwell/commit/35cb3ab04b06cb10de25c944344b2a1d41bfcdf8 is an example set of changes required to support a new docker source.

jlester-msft commented 1 year ago

As an update for this issue, this is a known issue that only affects call caching (call caching isn't currently working on Azure). The error comes from the Broad Institute's Cromwell server JAR. The error should be more properly worded as Unable to compute docker checksum, Docker lookup failed.. as the only thing missing is the ability for Cromwell to communicate with the server to compute things like image sizes and image checksums. Otherwise properly configured ACRs work just fine.

The Broad is currently working on call-caching and this warning message should go away in future releases.

See here for more information on how to configure ACRs to work with Cromwell on Azure: https://github.com/microsoft/CromwellOnAzure/wiki/Customizing-Your-Instance#use-private-docker-containers-hosted-on-azure

ngambani commented 10 months ago

@jlester-msft is this something that requires any work at MSFT end or is the Broad tackling this?

jlester-msft commented 9 months ago

@ngambani as of broadinstitute/cromwell:87-b6d1f50 the java.lang.Exception: Registry mcr.microsoft.com is not supported warning is still being displayed. This probably needs some MSFT work to ensure that CoA standalone deployments get support for call-caching + ACR outside of Terra -- requiring some Broad Cromwell image changes + deployment changes 1) ACR registry support is enabled in Cromwell image when not using Terra 2) Moving CoA to use the Azure Blob Filesystem driver (requirement for call-caching to be enabled) 3) call-caching being enabled on standalone CoA deployments