Closed kcibul closed 1 year ago
Hi @kcibul , thanks for reporting this, were you using a trigger file to start the workflow? If so, did the trigger file have any additional error message?
No -- I was using Cromshell to launch the workflow (which is working great BTW!)
@kcibul Did the not clear error message look similar to either of these two strings? If so, which one. If not, what (if anything) did you get?
In the Cromwell metadata I see:
"failures": [
{
"message": "Task JointGenotyping.FinalGatherVcf:NA:1 failed for unknown reason: FailedOrError",
"causedBy": []
}
],
Thank you. That confirms my reading of Cromwell's source code.
In the workflow's execution log, the other string should show up. $runStatus may (or may not) provide a little more insight, although it probably just says SYSTEM_ERROR
. However, IMHO, that's not good enough.
The causedBy
array should be the location Cromwell surfaces additional details that CoA gathers from that task attempt. Unfortunately, TES provides error information through a FULL
view of the task, while Cromwell only ever requests a MINIMAL
view, so the error details cannot not flow through. We'll discuss this with Cromwell so that when a task returns an error, they will requery TES to get the error details to better fill out that entry in the failures
array.
Along with that, we'll review what we return through that, to make sure that the information is actionable, or at least descriptive, and not overly generic.
There seems to be an issue with Batch attempting to allocate Standard_H series machines using CoA. A toy example WDL failed several times trying to allocate different Standard_H8, Standard_H16, Standard_H16m, and Standard_H16mr machines. Before eventually requesting a Standard_d32d_v4. Batch Explorer was displaying a Node Allocation Error of OverconstrainedAllocationRequestError when attempting to allocate Standard_H series machines. In addition to putting more details into the Cromwell logs machine families that can't allocate nodes should be deprioritized.
See attached CosmosDB data:
{
"id": "d19d6514_e8bdbca30fe245938bf2d8f42da5b188",
"state": "EXECUTOR_ERROR",
"name": "test.hello",
"description": "d19d6514-c32a-4979-9f49-166f3a18af7e:BackendJobDescriptorKey_CommandCallNode_test.hello:-1:1",
"inputs": [
{
"name": "commandScript",
"description": "test.hello.commandScript",
"url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
"path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
"type": "FILE",
"content": null
}
],
"outputs": [
{
"name": "rc",
"description": "test.hello.rc",
"url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/rc",
"path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/rc",
"type": "FILE"
},
{
"name": "stdout",
"description": "test.hello.stdout",
"url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stdout",
"path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stdout",
"type": "FILE"
},
{
"name": "stderr",
"description": "test.hello.stderr",
"url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stderr",
"path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stderr",
"type": "FILE"
},
{
"name": "commandScript",
"description": "test.hello.commandScript",
"url": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
"path": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script",
"type": "FILE"
}
],
"resources": {
"cpu_cores": null,
"preemptible": true,
"ram_gb": null,
"disk_gb": 2000,
"zones": null,
"backend_parameters": {},
"backend_parameters_strict": null
},
"executors": [
{
"image": "ubuntu@sha256:a3785f78ab8547ae2710c89e627783cfa7ee7824d3468cae6835c9f4eae23ff7",
"command": [
"/bin/bash",
"/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/script"
],
"workdir": null,
"stdin": null,
"stdout": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stdout",
"stderr": "/cromwell-executions/test/d19d6514-c32a-4979-9f49-166f3a18af7e/call-hello/execution/stderr",
"env": null
}
],
"volumes": null,
"tags": null,
"logs": [
{
"logs": [
{
"start_time": null,
"end_time": null,
"stdout": null,
"stderr": null,
"exit_code": null
}
],
"metadata": {
"Event: Requested new Batch job on pool": "2022-11-29T21:57:46.7725168+00:00",
"Event: Batch job scheduled": "2022-11-29T21:57:46.9990815+00:00",
"vm_size": "Standard_H16",
"vm_family": "standardHFamily",
"vm_low_priority": "true",
"vm_price_per_hour_usd": "0.361",
"vm_memory_in_gb": "112.0",
"vm_number_of_cores": "16",
"vm_resource_disk_size_in_gb": "2000.0",
"vm_max_data_disk_count": "64",
"failure_reason": "NodeAllocationFailed"
},
"start_time": "2022-11-29T21:57:46.9990815+00:00",
"end_time": "2022-11-29T21:58:22.6357101+00:00",
"outputs": null,
"system_logs": [
"NodeAllocationFailed"
]
},
{
"logs": [
{
"start_time": null,
"end_time": null,
"stdout": null,
"stderr": null,
"exit_code": null
}
],
"metadata": {
"Event: Requested new Batch job on pool": "2022-11-29T21:58:31.1222811+00:00",
"Event: Batch job scheduled": "2022-11-29T21:58:31.2780674+00:00",
"vm_size": "Standard_H16r",
"vm_family": "standardHFamily",
"vm_low_priority": "true",
"vm_price_per_hour_usd": "0.398",
"vm_memory_in_gb": "112.0",
"vm_number_of_cores": "16",
"vm_resource_disk_size_in_gb": "2000.0",
"vm_max_data_disk_count": "64",
"failure_reason": "NodeAllocationFailed"
},
"start_time": "2022-11-29T21:58:31.2780674+00:00",
"end_time": "2022-11-29T21:59:17.1292057+00:00",
"outputs": null,
"system_logs": [
"NodeAllocationFailed"
]
},
{
"logs": [
{
"start_time": null,
"end_time": null,
"stdout": null,
"stderr": null,
"exit_code": null
}
],
"metadata": {
"Event: Requested new Batch job on pool": "2022-11-29T21:59:24.3847307+00:00",
"Event: Batch job scheduled": "2022-11-29T21:59:24.6183151+00:00",
"vm_size": "Standard_H16m",
"vm_family": "standardHFamily",
"vm_low_priority": "true",
"vm_price_per_hour_usd": "0.484",
"vm_memory_in_gb": "224.0",
"vm_number_of_cores": "16",
"vm_resource_disk_size_in_gb": "2000.0",
"vm_max_data_disk_count": "64",
"failure_reason": "NodeAllocationFailed"
},
"start_time": "2022-11-29T21:59:24.6183151+00:00",
"end_time": "2022-11-29T21:59:55.2500237+00:00",
"outputs": null,
"system_logs": [
"NodeAllocationFailed"
]
},
{
"logs": [
{
"start_time": null,
"end_time": null,
"stdout": null,
"stderr": null,
"exit_code": null
}
],
"metadata": {
"Event: Requested new Batch job on pool": "2022-11-29T22:00:05.5046757+00:00",
"Event: Batch job scheduled": "2022-11-29T22:00:05.6563626+00:00",
"vm_size": "Standard_H16mr",
"vm_family": "standardHFamily",
"vm_low_priority": "true",
"vm_price_per_hour_usd": "0.533",
"vm_memory_in_gb": "224.0",
"vm_number_of_cores": "16",
"vm_resource_disk_size_in_gb": "2000.0",
"vm_max_data_disk_count": "64",
"failure_reason": "NodeAllocationFailed"
},
"start_time": "2022-11-29T22:00:05.6563626+00:00",
"end_time": "2022-11-29T22:01:21.9918015+00:00",
"outputs": null,
"system_logs": [
"NodeAllocationFailed"
]
}
],
"creation_time": "2022-11-29T21:57:37.9993287+00:00",
"error_count": 4,
"is_cancel_requested": false,
"end_time": "2022-11-29T22:01:21.9918208+00:00",
"workflow_id": "d19d6514-c32a-4979-9f49-166f3a18af7e",
"_partitionKey": "01",
"_etag": "\"2804bbcf-0000-0100-0000-638681310000\"",
"_rid": "cEpuAOZRfgnMgQEAAAAAAA==",
"_self": "dbs/cEpuAA==/colls/cEpuAOZRfgk=/docs/cEpuAOZRfgnMgQEAAAAAAA==/",
"_attachments": "attachments/",
"_ts": 1669759281
}
I have a PR ready for cromwell that takes advantage of the fact (provided by @jlester-msft) that Standard_H8
fails to be allocated to setup a test that will fail. That PR returns the following through cromshell's metadata subcommand with coa's tes 3.2:
"failures": [
{
"causedBy": [
{
"causedBy": [
{
"causedBy": [],
"message": "NoVmSizeAvailable"
},
{
"causedBy": [],
"message": "No VM (out of 430) available with the required resources (vmsize: Standard_H8, preemptible: True) for task id dcdf4ed9_d3b8958f80ba45f5a8d127d1862e0320. There are no VM sizes that match the requirements. Review the task resources."
}
],
"message": "Task test.hello:NA:1 failed for unknown reason: SYSTEM_ERROR"
}
],
"message": "Workflow failed"
}
],
which for the moment is a definite improvement. I'll work with the cromwell team to see what improvements may be needed to get this exactly right, but at least it surfaces more actionable information.
PR was moved by Cromwell to here: https://github.com/broadinstitute/cromwell/pull/6980
Fixed by broadinstitute/cromwell#6980, which should show up in Cromwell's release 85 (not yet released). I'll update this issue after that version releases when CoA includes that fixed version.
Problem: I had a failing task in my workflow, it was asking for 2 TB of disk. There was no clear error message other than a task has failed. However, the real problem was the node pool couldn't provision a disk that big.
Solution: A clear and concise message that is returned through the workflow metadata saying that a node could not be provisioned with the requested runtime attributes (and ideally which attribute/quota was exceeded)