Open jbusecke opened 5 months ago
I am wondering if my workers had and local storage to begin with:
The n1 machine family seems to generally have local SSD access?
But I am wondering if that needs to be specified as an option?
I think you also have to set load=True
. It's looking for a local file that isn't there.
Just opened https://github.com/pangeo-forge/pangeo-forge-runner/pull/183 to test if I can set larger HDD on the dataflow workers.
Ok so that seems to generally work, but it might be useful to somehow check if the current worker has any storage attached? Not sure if this is possible in general, but that would certainly increase the usability of that feature.
I think you also have to set
load=True
. It's looking for a local file that isn't there.
Should this be an error, rather than a warning? I did not see this anywhere in the logs on dataflow.
I guess this depends on where this is run.
but it might be useful to somehow check if the current worker has any storage attached?
If the worker has no storage attached, this would fail on OpenWithXarray
. Instead, it is failing on a combine step, after the fragment has been moved to another worker.
So ok I now ran another test&authuser=1). For the above test case, I am doing the following:
And I am getting these sort of error traces:
sorry for the ugly screenshot, but I am not even sure what could be happening here...
I am fairly confident I can exclude workers OOMing: The memory useage is very low, and each workers memory could hold the entire dataset (all files) in memory
As suggested by @rabernat in the meeting earlier today, I ran a test of my (M)RE (see #715 and []()) with
copy_to_local
set to True.This failed running on Google dataflow with several errors similar to this (from the dataflow workflow logs of this job&authuser=1)):
So this is not a fix for the issue in #715 yet.
I am not entirely sure how I should go about debugging this further. Any suggestions welcome.