Open jbusecke opened 5 months ago
Getting some errors like this:
FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc' [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenWithXarray/Open with Xarray-ptransform-69']
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 321, in <lambda>
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 233, in open_with_xarray
_copy_btw_filesystems(url_or_file_obj, target_opener)
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 32, in _copy_btw_filesystems
with input_opener as source:
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/core.py", line 105, in __enter__
f = self.fs.open(self.path, mode=mode)
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/spec.py", line 1298, in open
f = self._open(
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 191, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 355, in __init__
self._open()
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 360, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc'
I think this is due to the fact that I only provide a url, not a CacheFSSpecTarget
object to the stage.
@moradology maybe we should not allow string input?
Not a bad idea. Still, I wonder what's going wrong. Later in the process (in the ParDo
) a CacheFSSpecTarget
is required (https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR152), but it should be created here in the outer transform: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR205-R208
Perhaps relevant that it opens with //
rather than gs://
? Maybe it is not creating the target appropriately? Actually, yeah. A closer look at this trace shows that it is trying to use the local file system rather than google storage as is clearly desired
So weird that this is happening in only some elements! they seem to be reproducible (non-random against filenames)though! I ran the recipe again and it seemed to have failed on many of the same files. Will investigate further later today
Oh shoot! Wrapping the url in CacheFSSpecTarget
fixed it! Will up the number of concurrency again and test with the full dataset.
Note that I did not use CacheFSSpecTarget.from_url()
but did this instead:
cache_target = CacheFSSpecTarget(
fs = gcsfs.GCSFileSystem(),
root_path="gs://leap-scratch/data-library/feedstocks/cache_concurrent"
)
This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url
(I'm guessing)
Ok so I was able to run a complete lowres-mli here, with https-sync patch activated for both the caching and the openwith fsspec but I want the download to be faster.
Disabling the https-sync patch and setting concurrency to 20 gives me a bunch of these:
Name (https) already in the registry and clobber is False [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenURLWithFSSpec/MapWithConcurrencyLimit/open_url-ptransform-68']
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 123, in <lambda>
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 36, in open_url
open_file = _get_opener(url, secrets, fsspec_sync_patch, **kw)
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 234, in _get_opener
SyncHTTPFileSystem.overwrite_async_registration()
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/httpfs_sync/core.py", line 403, in overwrite_async_registration
register_implementation("https", cls)
File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/registry.py", line 53, in register_implementation
raise ValueError(
ValueError: Name (https) already in the registry and clobber is False
Wondering if this goes away if I reduce the concurrency.
This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with
.from_url
(I'm guessing)
@moradology should we track this in a separate issue? Just asking since I expect to close this PR soon.
Yoinks I am all the sudden getting a lot of failed transfers (for the mlo dataset). Not entirely sure if I am getting rate limited because I just downloaded 800GB of data in a short succession, or if one of the many alterations here screwed somethign.
Have now submitted a job with reduced concurrency for now, and will wait until tomorrow to continue.
I just tried to freeze the actual commit hash for the requirements and increase concurrency (all files are cached rn).
Testing https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750.
I did the following here:
Todo:
leap-scratch/data-library/feedstocks/cache_concurrent/000b7ecb864a18a4a2b56492d8cf35d4-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-05_e3sm-mmf.mli.0001-05-11-68400.nc
) - Confirmed here&authuser=1)