Open sharkinsspatial opened 1 year ago
Sure thing! This is an unexpected one...
severity: "ERROR"
textPayload: "Workflow failed. Causes: S13:Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/ReshufflePerKey/GroupByKey/Read+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/ReshufflePerKey/GroupByKey/GroupByWindow+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/ReshufflePerKey/FlatMap(restore_timestamps)+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/RemoveRandomKeys+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/finalize+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/AddRandomKeys+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/ReshufflePerKey/Map(reify_timestamps)+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/ReshufflePerKey/GroupByKey/Reify+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/ReshufflePerKey/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
a6170692e70616e67656f2d66-12161716-e74o-harness-x9x4
Root cause: The worker lost contact with the service.,
a6170692e70616e67656f2d66-12161716-e74o-harness-x9x4
Root cause: The worker lost contact with the service.,
a6170692e70616e67656f2d66-12161716-e74o-harness-kd6j
Root cause: The worker lost contact with the service.,
a6170692e70616e67656f2d66-12161716-e74o-harness-kd6j
Root cause: The worker lost contact with the service."
I wonder if it was just some one-off Dataflow internal issue and re-triggering the run would resolve it? These type of obscure start-up errors can come up when we change our sdk_container_image
but we haven't done that recently. 🤔
xref https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/203 (in which the same recipe is being discussed)
@sharkinsspatial I suggest a blank i.e. git commit --allow-empty
PR to this repo, so we can get a fresh production run triggered, and see if this was indeed just a random Dataflow error.
@cisaacstern When you have a moment can you post the failed recipe run logs from https://pangeo-forge.org/dashboard/recipe-run/1437?feedstock_id=94