microsoft / CromwellOnAzure

Microsoft Genomics implementation of the Broad Institute's Cromwell workflow engine on Azure
MIT License
133 stars 55 forks source link

More clear error message when VM can not be provisioned #531

Closed kcibul closed 1 year ago

kcibul commented 1 year ago

Problem: I had a failing task in my workflow, it was asking for 2 TB of disk. There was no clear error message other than a task has failed. However, the real problem was the node pool couldn't provision a disk that big.

Solution: A clear and concise message that is returned through the workflow metadata saying that a node could not be provisioned with the requested runtime attributes (and ideally which attribute/quota was exceeded)

MattMcL4475 commented 1 year ago

Hi @kcibul , thanks for reporting this, were you using a trigger file to start the workflow? If so, did the trigger file have any additional error message?

kcibul commented 1 year ago

No -- I was using Cromshell to launch the workflow (which is working great BTW!)

BMurri commented 1 year ago

@kcibul Did the not clear error message look similar to either of these two strings? If so, which one. If not, what (if anything) did you get?

kcibul commented 1 year ago

In the Cromwell metadata I see:

"failures": [
          {
            "message": "Task JointGenotyping.FinalGatherVcf:NA:1 failed for unknown reason: FailedOrError",
            "causedBy": []
          }
        ],
BMurri commented 1 year ago

Thank you. That confirms my reading of Cromwell's source code.

In the workflow's execution log, the other string should show up. $runStatus may (or may not) provide a little more insight, although it probably just says SYSTEM_ERROR. However, IMHO, that's not good enough.

The causedBy array should be the location Cromwell surfaces additional details that CoA gathers from that task attempt. Unfortunately, TES provides error information through a FULL view of the task, while Cromwell only ever requests a MINIMAL view, so the error details cannot not flow through. We'll discuss this with Cromwell so that when a task returns an error, they will requery TES to get the error details to better fill out that entry in the failures array.

Along with that, we'll review what we return through that, to make sure that the information is actionable, or at least descriptive, and not overly generic.

jlester-msft commented 1 year ago

There seems to be an issue with Batch attempting to allocate Standard_H series machines using CoA. A toy example WDL failed several times trying to allocate different Standard_H8, Standard_H16, Standard_H16m, and Standard_H16mr machines. Before eventually requesting a Standard_d32d_v4. Batch Explorer was displaying a Node Allocation Error of OverconstrainedAllocationRequestError when attempting to allocate Standard_H series machines. In addition to putting more details into the Cromwell logs machine families that can't allocate nodes should be deprioritized.

See attached CosmosDB data:

{
    "id": "d19d6514_e8bdbca30fe245938bf2d8f42da5b188",
    "state": "EXECUTOR_ERROR",
    "name": "test.hello",
    "description": "d19d6514-c32a-4979-9f49-166f3a18af7e:BackendJobDescriptorKey_CommandCallNode_test.hello:-1:1",
    "inputs": [
        {
            "name": "commandScript",
            "description": "test.hello.commandScript",
            "url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
            "path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
            "type": "FILE",
            "content": null
        }
    ],
    "outputs": [
        {
            "name": "rc",
            "description": "test.hello.rc",
            "url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/rc",
            "path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/rc",
            "type": "FILE"
        },
        {
            "name": "stdout",
            "description": "test.hello.stdout",
            "url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stdout",
            "path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stdout",
            "type": "FILE"
        },
        {
            "name": "stderr",
            "description": "test.hello.stderr",
            "url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stderr",
            "path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stderr",
            "type": "FILE"
        },
        {
            "name": "commandScript",
            "description": "test.hello.commandScript",
            "url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
            "path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
            "type": "FILE"
        }
    ],
    "resources": {
        "cpu_cores": null,
        "preemptible": true,
        "ram_gb": null,
        "disk_gb": 2000,
        "zones": null,
        "backend_parameters": {},
        "backend_parameters_strict": null
    },
    "executors": [
        {
            "image": "ubuntu@sha256:a3785f78ab8547ae2710c89e627783cfa7ee7824d3468cae6835c9f4eae23ff7",
            "command": [
                "/bin/bash",
                "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script"
            ],
            "workdir": null,
            "stdin": null,
            "stdout": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stdout",
            "stderr": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stderr",
            "env": null
        }
    ],
    "volumes": null,
    "tags": null,
    "logs": [
        {
            "logs": [
                {
                    "start_time": null,
                    "end_time": null,
                    "stdout": null,
                    "stderr": null,
                    "exit_code": null
                }
            ],
            "metadata": {
                "Event: Requested new Batch job on pool": "2022-11-29T21:57:46.7725168+00:00",
                "Event: Batch job scheduled": "2022-11-29T21:57:46.9990815+00:00",
                "vm_size": "Standard_H16",
                "vm_family": "standardHFamily",
                "vm_low_priority": "true",
                "vm_price_per_hour_usd": "0.361",
                "vm_memory_in_gb": "112.0",
                "vm_number_of_cores": "16",
                "vm_resource_disk_size_in_gb": "2000.0",
                "vm_max_data_disk_count": "64",
                "failure_reason": "NodeAllocationFailed"
            },
            "start_time": "2022-11-29T21:57:46.9990815+00:00",
            "end_time": "2022-11-29T21:58:22.6357101+00:00",
            "outputs": null,
            "system_logs": [
                "NodeAllocationFailed"
            ]
        },
        {
            "logs": [
                {
                    "start_time": null,
                    "end_time": null,
                    "stdout": null,
                    "stderr": null,
                    "exit_code": null
                }
            ],
            "metadata": {
                "Event: Requested new Batch job on pool": "2022-11-29T21:58:31.1222811+00:00",
                "Event: Batch job scheduled": "2022-11-29T21:58:31.2780674+00:00",
                "vm_size": "Standard_H16r",
                "vm_family": "standardHFamily",
                "vm_low_priority": "true",
                "vm_price_per_hour_usd": "0.398",
                "vm_memory_in_gb": "112.0",
                "vm_number_of_cores": "16",
                "vm_resource_disk_size_in_gb": "2000.0",
                "vm_max_data_disk_count": "64",
                "failure_reason": "NodeAllocationFailed"
            },
            "start_time": "2022-11-29T21:58:31.2780674+00:00",
            "end_time": "2022-11-29T21:59:17.1292057+00:00",
            "outputs": null,
            "system_logs": [
                "NodeAllocationFailed"
            ]
        },
        {
            "logs": [
                {
                    "start_time": null,
                    "end_time": null,
                    "stdout": null,
                    "stderr": null,
                    "exit_code": null
                }
            ],
            "metadata": {
                "Event: Requested new Batch job on pool": "2022-11-29T21:59:24.3847307+00:00",
                "Event: Batch job scheduled": "2022-11-29T21:59:24.6183151+00:00",
                "vm_size": "Standard_H16m",
                "vm_family": "standardHFamily",
                "vm_low_priority": "true",
                "vm_price_per_hour_usd": "0.484",
                "vm_memory_in_gb": "224.0",
                "vm_number_of_cores": "16",
                "vm_resource_disk_size_in_gb": "2000.0",
                "vm_max_data_disk_count": "64",
                "failure_reason": "NodeAllocationFailed"
            },
            "start_time": "2022-11-29T21:59:24.6183151+00:00",
            "end_time": "2022-11-29T21:59:55.2500237+00:00",
            "outputs": null,
            "system_logs": [
                "NodeAllocationFailed"
            ]
        },
        {
            "logs": [
                {
                    "start_time": null,
                    "end_time": null,
                    "stdout": null,
                    "stderr": null,
                    "exit_code": null
                }
            ],
            "metadata": {
                "Event: Requested new Batch job on pool": "2022-11-29T22:00:05.5046757+00:00",
                "Event: Batch job scheduled": "2022-11-29T22:00:05.6563626+00:00",
                "vm_size": "Standard_H16mr",
                "vm_family": "standardHFamily",
                "vm_low_priority": "true",
                "vm_price_per_hour_usd": "0.533",
                "vm_memory_in_gb": "224.0",
                "vm_number_of_cores": "16",
                "vm_resource_disk_size_in_gb": "2000.0",
                "vm_max_data_disk_count": "64",
                "failure_reason": "NodeAllocationFailed"
            },
            "start_time": "2022-11-29T22:00:05.6563626+00:00",
            "end_time": "2022-11-29T22:01:21.9918015+00:00",
            "outputs": null,
            "system_logs": [
                "NodeAllocationFailed"
            ]
        }
    ],
    "creation_time": "2022-11-29T21:57:37.9993287+00:00",
    "error_count": 4,
    "is_cancel_requested": false,
    "end_time": "2022-11-29T22:01:21.9918208+00:00",
    "workflow_id": "d19d6514-c32a-4979-9f49-166f3a18af7e",
    "_partitionKey": "01",
    "_etag": "\"2804bbcf-0000-0100-0000-638681310000\"",
    "_rid": "cEpuAOZRfgnMgQEAAAAAAA==",
    "_self": "dbs/cEpuAA==/colls/cEpuAOZRfgk=/docs/cEpuAOZRfgnMgQEAAAAAAA==/",
    "_attachments": "attachments/",
    "_ts": 1669759281
}
BMurri commented 1 year ago

I have a PR ready for cromwell that takes advantage of the fact (provided by @jlester-msft) that Standard_H8 fails to be allocated to setup a test that will fail. That PR returns the following through cromshell's metadata subcommand with coa's tes 3.2:

  "failures": [
    {
      "causedBy": [
        {
          "causedBy": [
            {
              "causedBy": [],
              "message": "NoVmSizeAvailable"
            },
            {
              "causedBy": [],
              "message": "No VM (out of 430) available with the required resources (vmsize: Standard_H8, preemptible: True) for task id dcdf4ed9_d3b8958f80ba45f5a8d127d1862e0320. There are no VM sizes that match the requirements. Review the task resources."
            }
          ],
          "message": "Task test.hello:NA:1 failed for unknown reason: SYSTEM_ERROR"
        }
      ],
      "message": "Workflow failed"
    }
  ],

which for the moment is a definite improvement. I'll work with the cromwell team to see what improvements may be needed to get this exactly right, but at least it surfaces more actionable information.

BMurri commented 1 year ago

PR is now @ https://github.com/broadinstitute/cromwell/pull/6980

BMurri commented 1 year ago

PR was moved by Cromwell to here: https://github.com/broadinstitute/cromwell/pull/6980

BMurri commented 1 year ago

Fixed by broadinstitute/cromwell#6980, which should show up in Cromwell's release 85 (not yet released). I'll update this issue after that version releases when CoA includes that fixed version.

BMurri commented 1 year ago

via https://github.com/microsoft/CromwellOnAzure/pull/589