Failed run log request.

sharkinsspatial commented 1 year ago

@cisaacstern When you have a moment can you post the failed recipe run logs from https://pangeo-forge.org/dashboard/recipe-run/1437?feedstock_id=94

cisaacstern commented 1 year ago

Sure thing! This is an unexpected one...

severity: "ERROR"
textPayload: "Workflow failed. Causes: S13:Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/ReshufflePerKey/GroupByKey/Read+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/ReshufflePerKey/GroupByKey/GroupByWindow+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/ReshufflePerKey/FlatMap(restore_timestamps)+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_000/RemoveRandomKeys+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/finalize+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/AddRandomKeys+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/ReshufflePerKey/Map(reify_timestamps)+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/ReshufflePerKey/GroupByKey/Reify+Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/Reshuffle_001/ReshufflePerKey/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: 

  a6170692e70616e67656f2d66-12161716-e74o-harness-x9x4
      Root cause: The worker lost contact with the service.,

  a6170692e70616e67656f2d66-12161716-e74o-harness-x9x4
      Root cause: The worker lost contact with the service.,

  a6170692e70616e67656f2d66-12161716-e74o-harness-kd6j
      Root cause: The worker lost contact with the service.,

  a6170692e70616e67656f2d66-12161716-e74o-harness-kd6j
      Root cause: The worker lost contact with the service."

I wonder if it was just some one-off Dataflow internal issue and re-triggering the run would resolve it? These type of obscure start-up errors can come up when we change our sdk_container_image but we haven't done that recently. 🤔

cisaacstern commented 1 year ago

xref https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/203 (in which the same recipe is being discussed)

cisaacstern commented 1 year ago

@sharkinsspatial I suggest a blank i.e. git commit --allow-empty PR to this repo, so we can get a fresh production run triggered, and see if this was indeed just a random Dataflow error.

pangeo-forge / noaa-atmosphere-climate-cloud-properties-isccp-hgg-basic-feedstock

Failed run log request. #2