Closed alxmrs closed 6 months ago
I'm testing this PR end-to-end with a script that uses this file pattern for a Grib 2 dataset (uses the cfgrib
backend in XArray), deployed on GCP's Dataflow. Right now I'm getting what appears to be a deadline. My pipeline ends with this error:
Root cause: The worker lost contact with the service.
Traces in logs show that threads are acquiring a lock, though it's unclear if it's just a big dataset and thus taking some time.
This to me seems like an instance of https://github.com/pydata/xarray/issues/4591. Right now, I'm going to experiment with changing the scheduler to use a single thread in the compute
method of _open_chunks()
.
I should mention: The Dataflow diagnostics for the above report is showing unresponsive threads, making a dead-lock scenario more sound.
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "keepalive watchdog timeout" debug_error_string = "{"created":"@1630579134.284653312","description":"Error received from peer ipv6:[::1]:12371","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"keepalive watchdog timeout","grpc_status":14}" >
at _next (/usr/local/lib/python3.8/site-packages/grpc/_channel.py:803)
at __next__ (/usr/local/lib/python3.8/site-packages/grpc/_channel.py:416)
at run (/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py:251)
at main (/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py:182)
Dead-locking seems plausible, but this is different from pydata/xarray#4591 which describes a serialization failure.
This has gone stale.
Fixes #29 and #38.