leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 5 forks source link

ClimSim production deployments failing with caching-related errors #36

Open cisaacstern opened 11 months ago

cisaacstern commented 11 months ago

35 successfully deployed production runs for both ClimSim recipes:

Screen Shot 2023-08-01 at 1 27 34 PM

both of these jobs failed with caching-related errors:

```console # mli pipeline RuntimeError: aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host huggingface.co:443 ssl:default [Network is unreachable] [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0002-02/E3SM-MMF.mli.0002-02-08-81600.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0008-03/E3SM-MMF.mli.0008-03-05-24000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] # mlo pipeline RuntimeError: aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host huggingface.co:443 ssl:default [Network is unreachable] [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0007-10/E3SM-MMF.mlo.0007-10-10-03600.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0005-05/E3SM-MMF.mlo.0005-05-11-06000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] ``` > (There is also this error; I need to double check which pipeline it's associated with...) > > ```console > aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81'] > ```

AFAICT, all of the urls listed as FileNotFound are in fact available from hugging face:

# this code runs without an error
# and also, I've manually downloaded a few of these files, which also worked fine

for url in urls:  # here, urls is the list of FileNotFound urls from the errors above
    r = requests.head(url)
    if not r.status_code == 302:  # http 302 means 'Found'
        raise FileNotFoundError

I therefore take these errors to be a symptom of rate-limiting by hugging face, because Dataflow scaled a cluster of 500-800 workers for each of these jobs.

The best solution for a rate limiting I've seen so far would be to implement some version of the PTransform linked by @alxmrs in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/389#issuecomment-1240150431. That will take development work, though, further delaying this job.

To unblock this, I'll try setting max_num_workers, to a more modest number, maybe 50 to start? If this works for caching, we can cancel the job after caching is complete, and then re-start it with more workers, since after the caching is complete we should not have networking issues to access the data. This is a bit awkward but I believe it's the fastest path to get this data built ASAP. Assuming this works, I'll re-visit the RateLimit transform as my next work item.

jbusecke commented 11 months ago

Very interesting! I actually ran into the same issues with a CMIP6 recipe of only 4 files, even when running these with only a single worker (all of these were using the local bakery and pgf-runner).

I therefore take these errors to be a symptom of rate-limiting by hugging face, because Dataflow scaled a cluster of 500-800 workers for each of these jobs.

So maybe that is not the root cause after all? It might still be a compounded issue, but to me this smells like a more general issue (maybe a version problem with fsspec?).

jbusecke commented 11 months ago

On a different note, I am not sure if we ever need 1000 workers (I think our data ingestion does not have to scale out as beastly as our analysis for instance)! So maybe we can have a more sensible global config option?

jbusecke commented 9 months ago

Just checking in here @cisaacstern. I am setting up a [project]() to keep track of things.

I decided to add a family of tags blocked: ... to enable us to quickly identify which components are actually blocking a particular issue/pr.

Am I correct in setting this to blocked by dataflow at the moment?