leap-stc / climsim_feedstock

Apache License 2.0
0 stars 0 forks source link

Test `CheckpointFileTransfer` from recipes PR #7

Open jbusecke opened 5 months ago

jbusecke commented 5 months ago

Testing https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750.

I did the following here:

Todo:

jbusecke commented 5 months ago

Getting some errors like this:

FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc' [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenWithXarray/Open with Xarray-ptransform-69']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
  File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 321, in <lambda>
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 233, in open_with_xarray
    _copy_btw_filesystems(url_or_file_obj, target_opener)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 32, in _copy_btw_filesystems
    with input_opener as source:
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/core.py", line 105, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/spec.py", line 1298, in open
    f = self._open(
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 191, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 355, in __init__
    self._open()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 360, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc'

I think this is due to the fact that I only provide a url, not a CacheFSSpecTarget object to the stage. @moradology maybe we should not allow string input?

moradology commented 5 months ago

Not a bad idea. Still, I wonder what's going wrong. Later in the process (in the ParDo) a CacheFSSpecTarget is required (https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR152), but it should be created here in the outer transform: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR205-R208

Perhaps relevant that it opens with // rather than gs://? Maybe it is not creating the target appropriately? Actually, yeah. A closer look at this trace shows that it is trying to use the local file system rather than google storage as is clearly desired

jbusecke commented 5 months ago

So weird that this is happening in only some elements! they seem to be reproducible (non-random against filenames)though! I ran the recipe again and it seemed to have failed on many of the same files. Will investigate further later today

jbusecke commented 5 months ago

Oh shoot! Wrapping the url in CacheFSSpecTarget fixed it! Will up the number of concurrency again and test with the full dataset.

jbusecke commented 5 months ago

Note that I did not use CacheFSSpecTarget.from_url() but did this instead:

cache_target = CacheFSSpecTarget(
    fs = gcsfs.GCSFileSystem(),
    root_path="gs://leap-scratch/data-library/feedstocks/cache_concurrent"
)
moradology commented 5 months ago

This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url (I'm guessing)

jbusecke commented 5 months ago

Ok so I was able to run a complete lowres-mli here, with https-sync patch activated for both the caching and the openwith fsspec but I want the download to be faster.

Disabling the https-sync patch and setting concurrency to 20 gives me a bunch of these:

Name (https) already in the registry and clobber is False [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenURLWithFSSpec/MapWithConcurrencyLimit/open_url-ptransform-68']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
  File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 123, in <lambda>
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 36, in open_url
    open_file = _get_opener(url, secrets, fsspec_sync_patch, **kw)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 234, in _get_opener
    SyncHTTPFileSystem.overwrite_async_registration()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/httpfs_sync/core.py", line 403, in overwrite_async_registration
    register_implementation("https", cls)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/registry.py", line 53, in register_implementation
    raise ValueError(
ValueError: Name (https) already in the registry and clobber is False

Wondering if this goes away if I reduce the concurrency.

jbusecke commented 5 months ago

This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url (I'm guessing)

@moradology should we track this in a separate issue? Just asking since I expect to close this PR soon.

moradology commented 5 months ago

Issue up here: https://github.com/pangeo-forge/pangeo-forge-recipes/issues/752

jbusecke commented 5 months ago

Yoinks I am all the sudden getting a lot of failed transfers (for the mlo dataset). Not entirely sure if I am getting rate limited because I just downloaded 800GB of data in a short succession, or if one of the many alterations here screwed somethign.

Have now submitted a job with reduced concurrency for now, and will wait until tomorrow to continue.

jbusecke commented 5 months ago

I just tried to freeze the actual commit hash for the requirements and increase concurrency (all files are cached rn).